Data Engineer (Spark Oriented) Resume

SUMMARY

7 years of experience in the field of IT including 3 years of experience in Hadoop ecosystem and 4 years of experience as a Data analyst.
Expertise on working in Hadoop ecosystem which includes Spark, HDFS, MapReduce, Yarn, Hive, Pig, Kafka, Zookeeper.
Hands on experience in developing applications on Spark using Spark Core, Spark SQL and Spark Streaming.
Used various spark Transformations including mapToPair, filter, flatMap, groupByKey, sortByKey, join, cogroup, union, repatition, coalesce, takeSampled, distinct, intersection, mapPartitions, mapPartitionsWithIndex and Actions for cleansing the input data.
Tuned the cluster for Spark to process the large data sets.
Experience in developing Pig scripts and Hive Query Language.
Hands on experience in application development using Java, RDBMS, and Linux shell scripting.
Involved in collecting and aggregating large amounts of log data using Apache Flume.
Hands on experience in using Sqoop to import data into HDFS from RDBMS and vice - versa.
Extensive experience on various version control tools like Git.
Experienced in using Scala, Java tools like Eclipse.
Good understanding of HDFS Design, Daemons, Name node Federation and HDFS high availability (HA).
Experience in working with Hadoop/Big-Data storage and analytical frameworks over Amazon AWS cloud using tools like SSH, Putty.
Experience with Python package Scikit-learn, Pandas, Numpy, Matplotlib and Seaborn Python libraries during development life cycle.
Experience in modeling using machine learning algorithms. Supervised learning (Logistic Regression, Random Forest, Gradient Boosting, Neural Networks), Unsupervised learning (Clustering).
Generate new datasets from raw data files imported or modify existing datasets using SET, MERGE, MODIFY, UPDATE, SQL and APPEND.
Hands on experience in SAS Programming, Merging SAS Data Sets, Developing SAS procedures, Preparing Data, Producing Reports, SAS Formats, SAS Functions, Storing and Managing Data in SAS Files.

TECHNICAL SKILLS

Programming Languages: Java, Scala, SAS, R, Python

Big Data technologies: Hive, Pig, HBase, Sqoop, Flume, Kafka, ZooKeeper, HDFS

Databases: MySQL

Cloud platform: AWS

IDEs: Eclipse, R studio, Jupyter Notebook

Operating Systems: Windows, Linux (CentOS, Ubuntu, Red-hat)

Build Tools: Maven

Version Control: Git

Machine learning: Random Forest, Logistic Regression, Linear Regression

Statistics: ANOVA, 2 sample t test

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer (Spark Oriented)

Responsibilities:

Experienced with tune performance, including JVM/Shuffle/Transformation and typical tune performance.
Experienced with troubleshooting and data skew for clusters.
Worked with Hadoop Ecosystem components like HDFS, Spark, Hive, Pig, Zookeeper and Shell scripting.
Involved in creating Hive tables, loading and analyzing data using hive queries.
Implemented performance tuning by using Partitions, Broadcasts, Efficient Joins and Pair RDD's.
Optimization of algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames.
Implemented Spark RDD transformations, Actions to implement business analysis.
Migrated the needed data from MySQL into HDFS using Sqoop and importing various formats of flat files into HDFS.
Good working knowledge in Linux shell environments using command line utilities.
Hands on experience in Spark Streaming with Kafka(Consumer).
Validated the Dstream and created generated new Dstream and saved the data into HDFS.

Environment: Hadoop 2.0, Spark 1.5 (Core, SQL, Streaming), Java 1.7, Scala 2.11, MySQL, Hive 0.13, CDH 5, Flume 1.5, Zookeeper 3.4.5, Kafka 2.9, Eclipse

Confidential, Plano, TX

Integration Engineer

Responsibilities:

Collected data from 4 data sources, including Confidential probes, EBM, BS and intel CPE.
Maintained health check for all five layers of EEA.
Generated health check report for all data sources.
Modified all hard code in the shell scripts.
Write shell scripts, SQL and create test data.
Troubleshooting script issues and performing design and code reviews.
Implemented Data Integrity and Data Quality checks using shell scripts.
Worked with Hadoop Ecosystem components like HDFS, Hive, Pig, HBase, Zookeeper and Shell scripting.
Experience in working with Hadoop/Big-Data storage and analytical using tools like SSH, Putty.

Environment: Hadoop 2.0, Flume 1.7, Zookeeper 3.4.11, Putty 0.7, WinSCP 5.11.1

Confidential, Dallas, TX

Data Programmer

Responsibilities:

Building Predictive Models for University Rankings using R
Established predictive models, association rules and cluster analysis based on university ranking in real world and provided important features as well as prediction results.
Predictive Models: Random forest, Artificial neural network and Gradient boosting decision tree.
Using Random Forest and Logistic Regression to Predict Romance of Student using Python
Established random forest and logistic models based on multivariate database of students’ s romance (Yes, No) and provided important variables as well as prediction results of test data.
ANOVA and SLR Analysis of Dementia Data using SAS
Established linear model based on multivariate database of clinical patients (nondementia, converted, dementia) profiled by OASIS (Open Access Series of Imaging Studies) with SAS and provided potential clues for clinical prediction.

Methods: Kruskal-Wallis, Rank-sum, Post-hoc comparison, Planned comparison, Linear regression

Confidential, Dallas, TX

Data Programmer

Responsibilities:

Experience in modeling using machine learning algorithms. Supervised learning (Logistic Regression, Random Forest, Gradient Boosting, Neural Networks), Unsupervised learning (Clustering)
Experienced in developing codes in R, Python, SAS, SPSS and Excel
Experience with Scikit-learn, Pandas, Numpy, Matplotlib and Seaborn Python libraries during development life cycle
Experienced in developing tools to support strategic business decision making and forecasting
Good skills in working with large datasets, and using advanced data analysis (SQL)

Environment: R Studio 0.99, Python 3.5, SAS 9

Confidential

Data Engineer

Responsibilities:

Used Pig as an ETL tool to do Transformations, joins and some pre-aggregations before storing data into HDFS.
Driven to architect Big Data solutions on multiple platforms using data analytics.
Experience in developing complex SQL queries, Stored Procedures, Functions.
Installed, configured, and updated Linux machines, with Ubuntu and CentOS. Deploy and install applications.
Experience in ingesting the streaming data to Hadoop clusters using Flume and Kafka.
Build predictive modeling using Logistic regression, Random forest, Multilayer perceptron and SVM.

Environment: Python 3.3, JupyterNotebook 4, Hadoop 2.0, Spark 1.0, Java 1.7, Scala 2.11, MySQL, CDH 3,Hive 0.12, Flume 1.5, Zookeeper 3.4.5, Kafka 2.9, Eclipse

Confidential

Data Analyst

Responsibilities:

Consulted with department director in the employment and refinement of fundraising solutions for non-profit groups via a SAS/macro-based finance model.
Good use of various statistical procedures including PROC CONTENTS, PROC FREQ, PROC MEANS, PROC REPORT, PROCGPLOT, PROC BOXPLOT, PROC CORR, PROC GLM, PROC ANOVA, PROC LOGISTIC other SAS/STAT or SAS/GRAPH procedures.
Experienced in producing RTF, HTML and PDF files using SAS/ODS, well versed with creating HTML Reports for financial data using SAS ODS facility.

Environment: SAS 9, SPSS 19

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship