We provide IT Staff Augmentation Services!

Big Data Developer Resume

Dallas, TX


  • Extensive Experience in IT industry as a Big Data Developer.
  • Worked in Electrical, E - commerce and Industries.
  • Experienced in data modeling, programming, data mining, large scale data acquisition, transformation and cleaning of structure and unsecured data.
  • Extensive experience of Big Data Ecosystem including Hadoop, HDFS, YARN, MapReduce, Spark, Hive, Impala, HBase, Sqoop, Flume, K afka, Oozie and Zookeeper.
  • Worked on several Hadoop distribution platforms, including Cloudera CDH and Hortonworks HDP.
  • Proficiency in programming languages, such as Java, Python, Scala, R and SQL.
  • Experienced with real-time data processing mechanism in Big Data Ecosystem such as Apache Kafka and Spark Streaming.
  • Experienced in Spark Scala API, Spark Python API to transfer, process and analyze data in different formats and structures.
  • Developed in writing HiveQL and developing Hive UDFs in Java to process and analyze data.
  • Implemented HBase for storing data as a backup and used Hue for Hive queries and created partitions according to day using Hive to improve performance.
  • Conducted transformation of data in formats like Sequence File, Avro and Parquet.
  • Adept at using Sqoop to migrate data between RDBMS, NoSQL and HDFS.
  • Worked with RDBMS including MySQL, Oracle SQL and Netezza.
  • Conducted with Machine Learning libraries including Scikit-learn, SciPy, Numpy, Pandas and NLTK in Python.
  • Built models by implementing Machine Learning algorithms including Linear Regression, Logistic Regression, SVM, Decision Tree, Random Forest and K-means.
  • Performed data visualization with Matplotlib, R Shiny, ggplot2 and Tableau.
  • Familiar with Windows & Linux operating systems for development.
  • Designed and created of Hive tables and worked on various performance optimizations like Partition, Bucketing in Hive.
  • Proven communication and leadership skills to work with professionals at different levels.


AWS cloud: EC2, Lambda, S3, Glacier, EMR, CloudFront, Scala 2.12.0, Python 2.0, R, SAS, Java

Hadoop/Spark Ecosystem \ Machine Learning: Hadoop 2.7.3, Spark 1.6.3, MapReduce, \ Regression, Neural Network, K-Means 2.1.1, Sqoop 1.99.7, Nifi 1.1.2, Flume 1.5.2, Kafka 1.0.0, Storm 1.1.0, Oozie 4.2.0, Zookeeper 3.4.6, YARN 2.7.3

IDE Application\ Database: Eclipse 4.6, Visual Studio 2016, COMSOL \ MySql 5.x, Oracle 12c, HBase 1.3.0, Impala Multi physics 5.2, Rational Rose 7.0, 2.7.0, Cassandra 3.10, PL/SQL 11 Notepad++ 7.3.2 7.2.x, DB2 11.1, MangoDB3.2


Confidential, Dallas, TX

Big Data Developer


  • Conducted Sqoop to monitor and collect real-time log data from the requests which send to the server and sink log-data into Message Queue of the Kafka .
  • Implemented Spark Streaming to do real-time streaming processing to the log data from the Message Queue, and process data with both stateless and stateful transformations with different log data.
  • Wrote UDFs by using Spark SQL and Spark Core to do ETL processes including data processing and data storage, transform processed data into tables and store them into Hive for scalable storage and fast query.
  • Focused on OMS and WMS part, including extract and process the data from the submitted order from customers, sending the order information to the TMS and get the transportation information to OMS, scheduling the people resources from ERP(Enterprise resource planning) and transform data to the WMS to schedule the store position of the order, etc.
  • Designed and created of Hive tables and worked on various performance optimizations like Partition, Bucketing in Hive.
  • Implemented Hive custom UDFs and Analyzed large data sets by running HiveQL to achieve comprehensive data analysis.
  • Migrated of MapReduce jobs and Hive queries into Spark transformations and actions to improve the performance.
  • Utilized Sqoop to import and output data between MySQL database and HDFS.
  • Convert raw data with sequence data format, such as Avro, and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Involved in application performance tuning and troubleshooting.

Environment: Cloudera CDH, HDFS, Java, Python, Scala, Spark, Hive, GCP, Elasticsearch, Play Framework, HTML, HBase, MongoDB, Docker, Oracle, MySQL, PostgreSQL

Confidential, Baton Rouge, LA

Consultant Analyst


  • Conducted data by Sqoop and upload to HDFS , processing batch data with Spark SQL to check the test scores of the students from Louisiana .
  • Wrote Spark SQL script in Scala for testing and transformation of data.
  • W rote HiveQL, User Defined Functions (UDF) and MapReduce jobs in Hive for data processing and analysis.
  • Built machine learning models to analysis the impaction of students’ performance .
  • Explored relationships of various indicators using logistic regression, T-test to draft research report on students’ performance behavior.
  • Connected Hive tables with Tableau and performed data visualization for report.
  • Performed unit testing for Spark and Spark Streaming with Pytest , ScalaCheck .

Environment: Hadoop, Cloudera CDH, HDFS, Python, Scala, Spark, Hive, HBase, Kafka, Sqoop, Zookeeper, Oozie, Docker, D3.JS, Tableau, Oracle, Netezza, Scala Check, Pytest

Confidential, Bowling Green, OH

Hadoop Developer/Analyst


  • Worked on Cloudera Hadoop ecosystem with Agile Development methodology.
  • Used Flume and Kafka to transform, enrich and stream transactions to different location.
  • Used Sqoop to transfer structured and unstructured data from different resources.
  • Performed data transformation, extraction and filtering using HiveQL/SQL and Python.
  • Used Hive and Impala for batch reporting and managed HDFS long term storage.
  • Scheduled workflow with Oozie.
  • Worked with analytic teams to visualize tables in Tableau for reporting.
  • Used JUnit and ScalaCheck for unit testing.
  • Used Git for collaboration and version control.

Environment: Java 7, CDH 5.3, Hadoop 2.3, Unix Shell, Python, Flume 1.4, Sqoop 1.4.5, Kafka 0.8.2, Hive 0.13, HBase 0.98, Impala, Solr 4.4, Tableau, JUnit, Oozie, MapReduce, MRUnit, Git

Confidential, Bowling Green, OH

Data Analyst


  • Experienced on loading and transforming of large sets of structured and semi structured data.
  • Created Hive tables, analyzed data with Hive Queries, and written Hive UDFs.
  • Responsible for implementing various modules of the application using Spring MVC architecture.
  • Experience in using Partitions, bucketing to create Hive tables for performance optimization.
  • Migrated data between RDBMS and HDFS/Hive with Sqoop.
  • Experience in defining job flows and wrote simple to complex MapReduce jobs.
  • Cluster coordination services through Zookeeper.
  • Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.
  • Create shell scripts using Python for administration, maintenance and troubleshooting.
  • Involved in gathering the requirements, designing, development and testing.
  • Documented systems process and procedures for future s.

Environment: Hadoop 2.5, MapReduce, HDFS, Spark 1.6, Hive 0.14, Pig 0.15, Sqoop 1.4.2, NGINX PLUS R7, Flume 1.6.0, ETL, Zookeeper 3.4, Java, JUnit 4.8, Python


Data Engineer


  • Involved ETL processes including data processing and storage.
  • Applied Spark using Scala to do the data batch processing and store the output in HBase for scalable storage and fast query.
  • Created data lake by extracting customer’s data from various data sources into HDFS. This includes data from Teradata, Mainframes, CVS and Excel.
  • Designed and created of Hive tables and worked on various performance optimizations like Partition, Bucketing in Hive.
  • Implemented Hive custom UDFs and analyzed large data sets by running HiveQL to achieve comprehensive data analysis including data transformation, cleansing and filtering.
  • Conducted Clustering Analysis using SQL to obtain customer segmentation, perform ELT and Exploratory data analysis; worked closely with sales team for new customer acquisition.
  • Wrote MapReduce jobs and User Defined Functions (UDF) in Hive for data aggregation.
  • Performed real-time queries using Apache Impala.
  • Worked with data science team to use Spark Dataframe API and ML library API to build models of probability of default (PD) on top of historical data.
  • Delivered real time credit card transaction data from multiple sources into Kafka messaging system.
  • Used Spark Streaming to consume real time order transaction data from Kafka, processed them and checked for recommendation using Spark ML library with the deployed model.
  • Stored streaming data into HBase.
  • Actively participated and provided feedback constructively during daily Stand up meetings and weekly Iterative review meetings.

Environment: Hadoop, Cloudera CDH, HDFS, Java, Python, Scala, Spark, Hive, Impala, HBase, Kafka, Talend, Zookeeper, Oozie, Oracle, Junit, MRUnit, Git, JIRA

Hire Now