Data Engineer (spark Oriented) Resume
SUMMARY
- 7 years of experience in the field of IT including 3 years of experience in Hadoop ecosystem and 4 years of experience as a Data analyst.
- Expertise on working in Hadoop ecosystem which includes Spark, HDFS, MapReduce, Yarn, Hive, Pig, Kafka, Zookeeper.
- Hands on experience in developing applications on Spark using Spark Core, Spark SQL and Spark Streaming.
- Used various spark Transformations including mapToPair, filter, flatMap, groupByKey, sortByKey, join, cogroup, union, repatition, coalesce, takeSampled, distinct, intersection, mapPartitions, mapPartitionsWithIndex and Actions for cleansing the input data.
- Tuned the cluster for Spark to process the large data sets.
- Experience in developing Pig scripts and Hive Query Language.
- Hands on experience in application development using Java, RDBMS, and Linux shell scripting.
- Involved in collecting and aggregating large amounts of log data using Apache Flume.
- Hands on experience in using Sqoop to import data into HDFS from RDBMS and vice - versa.
- Extensive experience on various version control tools like Git.
- Experienced in using Scala, Java tools like Eclipse.
- Good understanding of HDFS Design, Daemons, Name node Federation and HDFS high availability (HA).
- Experience in working with Hadoop/Big-Data storage and analytical frameworks over Amazon AWS cloud using tools like SSH, Putty.
- Experience with Python package Scikit-learn, Pandas, Numpy, Matplotlib and Seaborn Python libraries during development life cycle.
- Experience in modeling using machine learning algorithms. Supervised learning (Logistic Regression, Random Forest, Gradient Boosting, Neural Networks), Unsupervised learning (Clustering).
- Generate new datasets from raw data files imported or modify existing datasets using SET, MERGE, MODIFY, UPDATE, SQL and APPEND.
- Hands on experience in SAS Programming, Merging SAS Data Sets, Developing SAS procedures, Preparing Data, Producing Reports, SAS Formats, SAS Functions, Storing and Managing Data in SAS Files.
TECHNICAL SKILLS
Programming Languages: Java, Scala, SAS, R, Python
Big Data technologies: Hive, Pig, HBase, Sqoop, Flume, Kafka, ZooKeeper, HDFS
Databases: MySQL
Cloud platform: AWS
IDEs: Eclipse, R studio, Jupyter Notebook
Operating Systems: Windows, Linux (CentOS, Ubuntu, Red-hat)
Build Tools: Maven
Version Control: Git
Machine learning: Random Forest, Logistic Regression, Linear Regression
Statistics: ANOVA, 2 sample t test
PROFESSIONAL EXPERIENCE
Confidential
Data Engineer (Spark Oriented)
Responsibilities:
- Experienced with tune performance, including JVM/Shuffle/Transformation and typical tune performance.
- Experienced with troubleshooting and data skew for clusters.
- Worked with Hadoop Ecosystem components like HDFS, Spark, Hive, Pig, Zookeeper and Shell scripting.
- Involved in creating Hive tables, loading and analyzing data using hive queries.
- Implemented performance tuning by using Partitions, Broadcasts, Efficient Joins and Pair RDD's.
- Optimization of algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames.
- Implemented Spark RDD transformations, Actions to implement business analysis.
- Migrated the needed data from MySQL into HDFS using Sqoop and importing various formats of flat files into HDFS.
- Good working knowledge in Linux shell environments using command line utilities.
- Hands on experience in Spark Streaming with Kafka(Consumer).
- Validated the Dstream and created generated new Dstream and saved the data into HDFS.
Environment: Hadoop 2.0, Spark 1.5 (Core, SQL, Streaming), Java 1.7, Scala 2.11, MySQL, Hive 0.13, CDH 5, Flume 1.5, Zookeeper 3.4.5, Kafka 2.9, Eclipse
Confidential, Plano, TX
Integration Engineer
Responsibilities:
- Collected data from 4 data sources, including Confidential probes, EBM, BS and intel CPE.
- Maintained health check for all five layers of EEA.
- Generated health check report for all data sources.
- Modified all hard code in the shell scripts.
- Write shell scripts, SQL and create test data.
- Troubleshooting script issues and performing design and code reviews.
- Implemented Data Integrity and Data Quality checks using shell scripts.
- Worked with Hadoop Ecosystem components like HDFS, Hive, Pig, HBase, Zookeeper and Shell scripting.
- Experience in working with Hadoop/Big-Data storage and analytical using tools like SSH, Putty.
Environment: Hadoop 2.0, Flume 1.7, Zookeeper 3.4.11, Putty 0.7, WinSCP 5.11.1
Confidential, Dallas, TX
Data Programmer
Responsibilities:
- Building Predictive Models for University Rankings using R
- Established predictive models, association rules and cluster analysis based on university ranking in real world and provided important features as well as prediction results.
- Predictive Models: Random forest, Artificial neural network and Gradient boosting decision tree.
- Using Random Forest and Logistic Regression to Predict Romance of Student using Python
- Established random forest and logistic models based on multivariate database of students’ s romance (Yes, No) and provided important variables as well as prediction results of test data.
- ANOVA and SLR Analysis of Dementia Data using SAS
- Established linear model based on multivariate database of clinical patients (nondementia, converted, dementia) profiled by OASIS (Open Access Series of Imaging Studies) with SAS and provided potential clues for clinical prediction.
Methods: Kruskal-Wallis, Rank-sum, Post-hoc comparison, Planned comparison, Linear regression
Confidential, Dallas, TX
Data Programmer
Responsibilities:
- Experience in modeling using machine learning algorithms. Supervised learning (Logistic Regression, Random Forest, Gradient Boosting, Neural Networks), Unsupervised learning (Clustering)
- Experienced in developing codes in R, Python, SAS, SPSS and Excel
- Experience with Scikit-learn, Pandas, Numpy, Matplotlib and Seaborn Python libraries during development life cycle
- Experienced in developing tools to support strategic business decision making and forecasting
- Good skills in working with large datasets, and using advanced data analysis (SQL)
Environment: R Studio 0.99, Python 3.5, SAS 9
Confidential
Data Engineer
Responsibilities:
- Used Pig as an ETL tool to do Transformations, joins and some pre-aggregations before storing data into HDFS.
- Driven to architect Big Data solutions on multiple platforms using data analytics.
- Experience in developing complex SQL queries, Stored Procedures, Functions.
- Installed, configured, and updated Linux machines, with Ubuntu and CentOS. Deploy and install applications.
- Experience in ingesting the streaming data to Hadoop clusters using Flume and Kafka.
- Build predictive modeling using Logistic regression, Random forest, Multilayer perceptron and SVM.
Environment: Python 3.3, JupyterNotebook 4, Hadoop 2.0, Spark 1.0, Java 1.7, Scala 2.11, MySQL, CDH 3,Hive 0.12, Flume 1.5, Zookeeper 3.4.5, Kafka 2.9, Eclipse
Confidential
Data Analyst
Responsibilities:
- Consulted with department director in the employment and refinement of fundraising solutions for non-profit groups via a SAS/macro-based finance model.
- Good use of various statistical procedures including PROC CONTENTS, PROC FREQ, PROC MEANS, PROC REPORT, PROCGPLOT, PROC BOXPLOT, PROC CORR, PROC GLM, PROC ANOVA, PROC LOGISTIC other SAS/STAT or SAS/GRAPH procedures.
- Experienced in producing RTF, HTML and PDF files using SAS/ODS, well versed with creating HTML Reports for financial data using SAS ODS facility.
Environment: SAS 9, SPSS 19
