We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Malvern, PA

SUMMARY:

  • Extensive Experience in IT industry as a Big Data Developer.
  • Worked in Electrical, E - commerce and Education Industries.
  • Experienced in data modeling, programming, data mining, large scale data acquisition, transformation and cleaning of structure and unsecured data.
  • Extensive experience of Big Data Ecosystem including Hadoop, HDFS, YARN, MapReduce, Spark, Hive, HBase, Sqoop, Flume, Oozie.
  • Worked on several Hadoop distribution platforms, including Cloudera CDH and Hortonworks HDP.
  • Proficiency in programming languages, such as Python, Scala, R and SQL.
  • Experienced with real-time data processing mechanism in Big Data Ecosystem such as Apache Kafka and Spark Streaming.
  • Experienced in Spark Scala API, Spark Python API to transfer, process and analyze data in different formats and structures.
  • Developed in writing HiveQL and developing Hive UDFs in Scala to process and analyze data.
  • Implemented HBase for storing data as a backup and used Hue for Hive queries and created partitions according the date using Hive to improve performance.
  • Conducted transformation of data in formats like Sequence File, Avro and Parquet.
  • Adept at using Sqoop to migrate data between RDBMS, NoSQL and HDFS.
  • Worked with RDBMS including MySQL, Oracle SQL and Netezza.
  • Conducted with Machine Learning libraries including Scikit-learn, SciPy, Numpy, Pandas and NLTK in Python.
  • Built models by implementing Machine Learning algorithms including Linear Regression, Logistic Regression, SVM, Decision Tree, Random Forest and K-means.
  • Performed data visualization with Matplotlib, R Shiny, ggplot2 and Tableau.
  • Familiar with Windows & Linux operating systems for development.
  • Designed and created of Hive tables and worked on various performance optimizations like Partition, Bucketing in Hive.
  • Proven communication and leadership skills to work with professionals at different levels.

TECHNICAL SKILLS:

services\ Programming Language: EC2, S3, EMR\ Scala 2.12.0, Python 2.0, R, SAS, Java\

Hadoop/Spark Ecosystem \ Machine Learning\: Hadoop 2.7.3, Spark 1.6.3, MapReduce, \ Regression, Neural Network, K-Means 2.1.1, Sqoop 1.99.7, Nifi 1.1.2, Flume 1.5.2Kafka 1.0.0, Storm 1.1.0, Oozie 4.2.0, \ Zookeeper 3.4.6, YARN 2.7.3\

IDE Application\ Database: Eclipse 4.6, Visual Studio 2016, COMSOL \ MySql 5.x, Oracle 12c, HBase 1.3.0, Impala Multi physics 5.2, Rational Rose 7.0, \ 2.7.0, Cassandra 3.10, PL/SQL 11 \ Notepad++ 7.3.2\ 7.2.x\

PROFESSIONAL EXPERIENCE:

Confidential, Malvern, PA

Big Data Engineer

Responsibilities:

  • Implementing cost efficient data by Hive and Pyspark to collect data from source tables to destination tables on AWS cluster.
  • Performing scalable batch data by shell scripts .
  • Scheduled workflow with Oozie.
  • Migrated of MapReduce jobs and Hive queries into Spark transformations and actions to improve the performance.
  • Developing Pyspark script to improve program performance.

Environment: AWS, S3, EMR, EC2, Hadoop, Pyspark, SparkSQL, Hive, MapReduce, Jupyter Notebook, Hue

Confidential, Piscataway, NJ

Big Data Developer

Responsibilities:

  • Conducted Sqoop and Flume to monitor and collect batch data log data from the requests which send to the HDFS .
  • Wrote UDFs by using Spark SQL and Spark Core to do ETL processes including data processing and data storage, transform processed data into tables and store them into Hive for scalable storage and fast query.
  • Focused on OMS and WMS part, including extract and process the data from the submitted order from customers, sending the order information to the TMS and get the transportation information to OMS, scheduling the people resources from ERP(Enterprise resource planning) and transform data to the WMS to schedule the store position of the order, etc.
  • Designed and created of Hive tables and worked on various performance optimizations like Partition, Bucketing in Hive.
  • Implemented Hive custom UDFs and Analyzed large data sets by running HiveQL to achieve comprehensive data analysis.
  • Migrated of MapReduce jobs and Hive queries into Spark transformations and actions to improve the performance.
  • Utilized Sqoop to import and output data between MySQL database and HDFS.
  • Convert raw data with sequence data format, such as Avro, and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Involved in application performance tuning and troubleshooting.

Environment: Cloudera CDH, HDFS, Python, Scala, Spark, Hive, Play Framework, HBase, Docker, Oracle, MySQL

Confidential, Baton Rouge, LA

Education Consultant Analyst

Responsibilities:

  • Conducted data by Sqoop and upload to H DFS , processing batch data with Spark SQL to check the test scores of the students from Louisiana .
  • Wrote Spark SQL script in Scala for testing and transformation of data.
  • W rote HiveQL, User Defined Functions (UDF) and MapReduce jobs in Hive for data processing and analysis.
  • Built data ETL Pipeline by SQL queries.
  • Built machine learning models to analysis the impaction of students’ performance .
  • Explored relationships of various indicators using logistic regression, T-test to draft research report on students’ performance behavior.
  • Connected Hive tables with Tableau and performed data visualization for report.
  • Performed unit testing for Spark with Pytest .

Environment: Hadoop, Cloudera CDH, HDFS, Python, Scala, Spark, Hive, SAS, Sqoop, Zookeeper, Oozie, Scala Check, Pytest

Confidential, Bowling Green, OH

Data Analyst

Responsibilities:

  • Experienced on loading and transforming of large sets of structured and semi structured data.
  • Created Hive tables, analyzed data with Hive Queries, and written Hive UDFs.
  • Responsible for implementing various modules of the application using Spring MVC architecture.
  • Experience in using Partitions, bucketing to create Hive tables for performance optimization.
  • Migrated data between RDBMS and HDFS/Hive with Sqoop.
  • Experience in defining job flows and wrote simple to complex MapReduce jobs.
  • Cluster coordination services through Zookeeper.
  • Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.
  • Create shell scripts using Python for administration, maintenance and troubleshooting.
  • Involved in gathering the requirements, designing, development and testing.
  • Documented systems process and procedures for future references.

Environment: Hadoop 2.5, MapReduce, HDFS, Spark 1.6, Hive 0.14, Sqoop 1.4.2, Flume 1.6.0, ETL, Zookeeper 3.4, Python

Confidential

Data Engineer

Responsibilities:

  • Involved ETL processes including data processing and storage.
  • Applied Spark using Python to do the data batch processing and store the output in Hive for scalable storage and fast query.
  • Created data lake by extracting customer’s data from various data sources into HDFS. This includes data from Teradata, Mainframes, CVS and Excel.
  • Designed and created of Hive tables and worked on various performance optimizations like Partition, Bucketing in Hive.
  • Implemented Hive custom UDFs and analyzed large data sets by running HiveQL to achieve comprehensive data analysis including data transformation, cleansing and filtering.
  • Conducted Clustering Analysis using SQL to obtain customer segmentation, perform ELT and Exploratory data analysis; worked closely with sales team for new customer acquisition.
  • Wrote MapReduce jobs and User Defined Functions (UDF) in Hive for data aggregation.
  • Performed real-time queries using Apache Impala.
  • Worked with data science team to use Spark Dataframe API and ML library API to build models of probability of default (PD) on top of historical data.
  • Used Spark Streaming to consume real time order transaction data from Kafka, processed them and checked for recommendation using Spark ML library with the deployed model.
  • Stored streaming data into HBase.
  • Actively participated and provided feedback constructively during daily Stand up meetings and weekly Iterative review meetings.

Environment: Hadoop, HDFS, Python, Spark, Hive, Impala, HBase, Kafka, Zookeeper, Oozie, Oracle, Junit, MRUnit, Git, JIRA

Hire Now