Big Data Engineer Resume
Piscataway, NJ
SUMMARY:
- Highly - motivated and data-oriented Big Data Engineer and Data Analyst with working experience in the domains including healthcare, social media, and E-Commerce.
- Solid understanding of Hadoop Ecosystem components such as HDFS, YARN, ZooKeeper, Spark, Kafka, Hbase, Hive, Pig, Sqoop, Flume, Storm, Flink, Oozie, NiFi, Kudu, Impala, etc.
- In-Depth knowledge in Algorithms, Data Structures, Object-Oriented Design with Core Java, and Functional Programming paradigm with Scala.
- Worked on migrating on-premise technologies into Google Cloud Platform (GCP) using Cloud Storage, Cloud Spanner, DataStore, DataFlow, DataProc, Pub/Sub, BigTable, BigQuery, etc.
- Familiarity with Amazon Web Service (AWS) services and components such as S3, RDS, RedShift, DynamoDB, EC2, EMR, Kinesis, Lambda, Glue, etc.
- Thorough understanding of SDLC (agile, scrum), data modeling (database normalization, ERD, OLTP), dimensional modeling (Star schema, Snowflake schema, OLAP/Data Warehouse), and Data Lake.
- Deep knowledge in operating on traditional RDBMS including Microsoft SQL Server, MySQL, Oracle, PostgreSQL, etc.
- Hands-On experience in developing consistent/ highly available solutions using NoSQL databases such as MongoDB, Cassandra, HBase and search engine such as ElasticSearch and Solr.
- Extensive experience in converting Java MapReduce programming paradigm into Spark Core, Spark DataFrame/Dataset code to boost performance.
- Demonstrated ability to create visually appealing and interactive reports/dashboards using software such as PowerBI, Tableau, Jupyter Notebook, Zeppelin, R Shiny, etc.
- Converted and implemented conceptual business rules into T-SQL stored procedures and functions in conjunction with objects such as CTE, temporary tables, subqueries, etc.
- Competence with Pig Latin scripts and HiveQL queries for preprocessing and analyzing large volumes of data.
- Well-versed in collecting, processing and aggregating large amounts of real-time streaming data using Kafka and Spark Streaming.
- Proficient in writing distributed and scalable code using Spark components including Spark Core, Spark SQL, Spark MLlib.
- Competence with building scalable and durable data pipelines (including extract, transform, and load (ETL) process) using tools such as Spark, Kafka, Flume, HDFS, MongoDB, Cassandra, etc.
- Strong ability to implement statistical methods (A\B testing, ANOVA, GLM, PCA, time series analysis, spatial statistics, factory analysis, Bayesian statistics, MCMC) using R and Python for exploring hidden patterns in the data.
- Adapt in applying machine learning techniques (linear/logistic regression, clustering, classification, association analysis, decision trees, random forests, XGBoost, NLP, SVM, neural network, deep learning) using R, Python and Spark MLlib to fulfill business needs.
- Expert in managing full life cycle of data science project including data collection, data cleansing, data exploratory analysis and predictive modelling using structured and unstructured data sources.
- Good team player but can also work independently in a fast-moving environment (agile or scrum).
TECHNICAL SKILLS:
Hadoop/Spark Ecosystem Database: Hadoop 2.x, MapReduce, Spark 2.x, Pig 0.12, \ Oracle 11g, MySQL 5.x, HBase 0.98, Cassandra Hive 0.14, Sqoop 1.4.6, Flume 1.6.0, Kafka \ 2.1.x, MongoDB 3.2, Microsoft SQL Server 12.0\ 0.9.x, Yarn, Mesos, Zookeeper 3.4.x \
Programming Language Operating System: Java 10.0, Scala 2.12, Spark, Python, SQL, \ Linux, Mac OS, Windows \ Unix/Bash shell, R 3.6.0, STATA, SAS 9.4 \
Cloud Platform Environment & Tools: Amazon Web Services EC2/EMR/S3 \ Git/Github, Agile/Scrum, IntelliJ IDEA, \ DynamoDB, RedShift, Google Cloud Platform \ Eclipse, loudera (CDH), Hortonworks (HDP) \ BigQuery, BigTable DataProc, Pub\Sub, \ DataFlow, DataStore \
Packages: \ Machine Learning Algorithms\
Python Pandas, Numpy, SciKit: learn \ Na ve Bayes, Decision trees, Linear and Logistic \ statsmodels, SciPy, Scrapy, Beautiful Soup \ Regression, SVM, KNN, Random ForestSeaborn, Matplotlib; R- rpart, e1071, dplyr \ k-means, LDA, Bagging, Gradient Boosting, \ tidyr, reshape2, stats, caret, ggplot2, shiny \ XGBoost, Time-Series \
PROFESSIONAL EXPERIENCE:
Confidential, Piscataway, NJ
Big Data Engineer
Responsibilities:
- Worked with business analysts and front-end engineers to translate business requirements into building scalable and reliable data pipelines.
- Collected logs (user clicks, user search, user orders, etc.) from multiple web servers using Flume and directed web server logs to Kafka.
- Configured HDFS and Cassandra as Kafka consumers to ingest collected web server logs.
- Calculated offline page conversion rate and top 10 popular products using Spark Core and Spark SQL.
- Tuned up YARN to manage resources inside clusters and setup ZooKeeper to coordinate Kafka internal mechanisms.
- Integrated Spark Streaming and Kafka to compute real-time statistics on advertisement traffic and the most popular products in the last hour.
- Designed stratified random sampling rules to extract subsets from web server logs for targeted user sessions using Spark Core.
- Populated previously calculated offline and real-time statistics into AWS RedShift for BI team.
- Utilized Spark tuning techniques such as RDD persistence, broadcast variable, accumulator, data serialization, etc. to boost performance.
- Converted raw data into Parquet and Avro to reduce network traffic and enable faster data processing.
Environment: Hadoop, HDFS, Kafka 0.9.x, Spark 2.x, Spark Streaming, Spark SQL, Zookeeper 3.4.x, YARN, Cassandra, AWS RedShift, Git, Linux
Confidential
Big Data Engineer
Responsibilities:
- Collaborated with software engineers to ingest customers’ data from website and periodically reported to team lead about the project progress.
- Published web server logs from Flume sinks into Kafka brokers, setup HDFS and MongoDB as Kafka consumers to collect web server logs.
- Applied direct approach to integrate Kafka and Spark Streaming and performed online recommendation algorithms using Spark Streaming.
- Developed and implemented offline recommendation algorithms such as Alternating Least Squares (ALS) using Spark MLlib.
- Retrieved customers’ data from MongoDB, calculated product similarity and average product rating using Spark SQL.
- Stored previously calculated algorithm results and statistics into Kudu and combined Impala (Fast Data for Fast Analytics) for analytics team.
- Wrote highly optimized Spark SQL code to perform extraction, transformation, and loading on a daily basis.
- Utilized compression format such as Snappy and GZIP to reduce network overhead and enhance throughput.
- Used Git for version control and JIRA for project tracking.
- Participated in software development lifecycle including scope, design, implement, test and code reviews.
Environment: Hadoop, HDFS, Kafka 0.9.x, Spark 2.x, Spark Streaming, Spark SQL, Spark MLlib, MongoDB, Zookeeper 3.4.x, YARN, Git, JIRA, Linux
Confidential
Data Science Engineer
Responsibilities:
- Integrated data from various resources including customer behavior data, transactional data, portfolio, etc. by querying and processing large volumes of data using Hive on HDFS.
- Collected products information from Cassandra database.
- Performed exploratory data analysis and data preprocessing on order history using R packages such as dplyr and tidyr in SparkR environment.
- Extracted patterns in the structured and unstructured data sets and displayed them with interactive charts using ggplot2 and ggiraph packages in R.
- Built initial models using supervised classification techniques such as K-Nearest Neighbor (KNN), Logistic Regression, Random Forests and Majority Voting Algorithm.
- Measured feature correlation using Pearson Correlation Coefficient (PCC) for identifying highly correlated features in the data used for dimensionality reduction.
- Used K-Fold cross validation to overcome the problem of overfitting.
- Built models using K-means clustering to create user groups.
- Used item-based Collaborative Filtering Algorithm to improve the prediction accuracy.
- Created a hybrid model to support new user recommendation and existing ones with changing trends.
- Used RMSE and Mean Average Precision to evaluate recommender’s performance.
- Participated in deploying model in production and monitored user activity and add-on sales from items that were recommended without being searched.
- Used the results to tune the model parameters and rebuild the model.
- Created visualizations to convey results and analyze data using Tableau.
Environment: R 3.3, AWS S3, AWS EC2, Apache Hadoop 2.0, Apache Spark 2.x, Apache Hive, Apache HBase 1.1, SparkR, Cassandra 3.1, Tableau, Linux
Confidential
Research Data Analyst
Responsibilities:
- Developed R package EL2Surv in survival analysis, debugged codes relevant to this package, tested codes with alcoholic hepatitis data, and conducted literature review on non-parametric Kaplan-Meier type statistics for the research publication.
- Employed R to analyze and visualize the distribution of extracted data using descriptive statistics, histograms, density plots, QQ plots and so on.
- Regularly utilized R Markdown (LaTeX syntax) to deliver visually appealing reports to support research team for understanding the shape of data and uncovering possible anomalies.
- Applied principal component analysis on 19 surrogate points each associated with pressure pain threshold value to identify the most sensitive one(s) that represent the pain caused by Fibromyalgia using R.
- Extended Vardi's (1982) multiplicative censorship model via the non-parametric empirical likelihood framework and applied results from Ning et al. (2013) to derive the chi-square distribution of the test statistics.
Confidential
Research Data Analyst
Responsibilities:
- Performed data cleansing such as filling missing values, data transformation such as creating dummy variables, data normalization such as rescaling and standardizing and more to match the desired formats according to distinct econometrics models using Stata.
- Preprocessed a variety of datasets in labor economics concerning women’s education and fertility using Stata.
- Devised algorithms for poorly behaved survey datasets, where the variable names changed every 3 year, to enhance efficiency and saved 50% processing time when compared with earlier work.
Confidential
Research Data Analyst
Responsibilities:
- Wrote complex T-SQL queries to extract information from National Health Insurance Research Database (99.99% Taiwanese population, 22 million people, over 5TB and 7 years) in Taiwan using user defined functions and user defined stored procedures and load the transformed data into regression models to explore on drug prescriptions behaviors among senior and junior hospital physicians, and to examine the social contagious behaviors within each hospital.
- Integrated resources from both private and public databases along with the NHIRD data to decipher the encrypted hospital names to help identify the potential focal physicians in each hospital and understand the specific hospital policies that might help increase domain knowledge in performing statistical inference.
- Coded complex formulas to generate variables that were used in statistical models such as converting ICD 9 (a record contained three ICD 9 columns) into meaningful rates, calculating hospital physicians’ work experience by accumulating days s/he stayed in each hospital (many physicians switched jobs from one hospital to another) and more using Stata.
- Analyzed social factors influencing medical centers’ physicians’ initial adoption (prescription) of duloxetine hydrochloride via cox proportional regression using Stata, results were published in journal.
- Conducted statistical analysis and methods such as ANOVA, hypothesis testing, survival analysis and more to examine the clinical efficacy of new antidepressant drug release using R.