We provide IT Staff Augmentation Services!

Big Data Engineer Resume


  • Big Data Developer and Data Analyst with extensive working experience across multiple industries including technology services, hospitality, and manufacturing.
  • Solid understanding of Hadoop Ecosystem components such as HDFS, YARN, MapReduce, Apache Drill, ZooKeeper, Spark, Kafka, Hbase, Hive, Flume, Oozie, and Impala.
  • Solid programming skills in both Object - Oriented and Functional languages such as Python, Java, Scala, and R.
  • Knowledge of Amazon Web Service (AWS) services and components such as S3, RDS, RedShift, DynamoDB, Lambda, Glue, etc.
  • Well-versed in operating on traditional RDBMS including Microsoft SQL Server, MySQL, Oracle, and PostgreSQL.
  • Hands-On experience with NoSQL data platforms such as MongoDB and HBase.
  • Extensive experience in converting Java MapReduce programming paradigm into Spark Core, Spark DataFrame/Dataset code to boost performance.
  • Strong knowledge of collecting, processing and aggregating large amounts of real-time streaming data using Kafka and Spark Streaming.
  • Proficient in writing distributed and scalable code using Spark components including Spark Core, Spark SQL, Spark MLlib.
  • Competence with building scalable and reliable data pipelines using tools such as Spark, Kafka, Flume, HDFS, etc.
  • Understanding of different Algorithms & Machine Learning techniques (Supervised & Unsupervised) and their application in batch/real-time.
  • Working experience in creating concise and interactive reports using Tableau, Jupyter Notebook, R Shiny, etc.
  • Excellent technical, communication, analytical and problem solving skills to work with people cross-functionally.


Hadoop/Spark Ecosystem: Hadoop 2.x, Spark 2.x, Spark 3.x, Hive 2.0, HBase 1.x, Cassandra 3.11.x, MySQL 5.x, Sqoop 1.4.x, Flume 1.7.0, Kafka 2.x, Yarn 1.x, \ MongoDB 4.0.x, Microsoft SQL Server 14.0 Mesos 1.x, Zookeeper 3.4.x Oracle Database 11g, Snowflake 3.x

Programming Language: Java 8, Scala 2.13.x, Python 3.0, R 3.6.x\ Linux, Mac OS, Windows

Cloud Platform: Amazon Web Services EC2/EMR/S3Git/Github, Agile/Scrum, IntelliJ IDEA, DynamoDB, Google Cloud Platform\ Eclipse, Cloudera (CDH), Hortonworks (HDP)

Python Pandas, NumPy, Scikit: Learn, Na ve Bayes, Linear and Logistic Regression Selenium, Beautiful Soup, Matplotlib \ PCA, SVM, KNN, Time-series, Decision Trees, tidyverse, e1071, caret, ggplot2\ Random Forest, Bagging, Boosting, Clustering



Big Data Engineer


  • Collaborated with the Business Intelligence and Engineering team in understanding the problems and architecture of the current monitoring system
  • Increased Spark Streaming computer power from 15 w/s to 50 w/s with both performance tuning and development tuning
  • Maximized the use of cluster resources by increasing driver-memory and executor-memory, modifying the number of CPUs, and using Kryo serialization
  • Improved the processing speed from the original 10 seconds to 0.5 seconds by finding an optimal setting number for partition within Spark
  • Increased efficiency by developing Scala code to batch putting data into HBase and return results to Kafka
  • Scaled Kafka clusters to improve failure tolerance by reassigning Zookeeper services
  • Decreased operation and maintenance costs by modifying Scala code of a listener interface for receiving information such as scheduling delay and processing time

Environment: Apache Flume 1.8.0, Kafka 2.3.0, Zookeeper 3.5.5, Spark 2.4.3, HBase 2.2.0, MySQL, Scala 2.12.0


Big Data Developer


  • Communicated with cross-functional stakeholders to understand the user scenarios and requirements for specifying the information and functions view of the architecture
  • Built the architecture of a Big Data Storage and Analytics Platform
  • Deployed NiFi clusters to ingest, transform and deliver to data analytics backends serving all purposes of data mediation both for real-time and batch jobs
  • Built the core components of the Cross-Sectorial Data Lab platform based on the standard Hadoop infrastructure, which consists of HDFS and YARN
  • Adopted Spark for batch, stream processing, and SQL data analysis
  • Combined DL4J (Deep Learning 4 Java), an open-source deep-learning library, and Keras, a high-level neural networks API for parallel GPU/CPU computation
  • Integrated platforms using the Kafka messaging system for real-time communication and native HDFS web service interface for batch data updates
  • Secured lab platform by using Apache Knox Gateway to extend access to cross-sectorial Data Lab platforms while maintaining compliance with enterprise security policies
  • Supported the development of the predictive function with Apache Zeppelin as a basis for the development tools component
  • Monitored, configured and managed the clusters using Apache Ambari, HDFS Name Node and YARN resource manager

Environment: Apache NiFi 1.8.0, Hadoop 2.8.x, HDFS 2.8.x, YARN 1.3.2, Spark 2.1.0, Kafka 1.0.1, Apache Knox 0.14.0, Apache Zeppelin 0.7.3


Data Engineer


  • Worked with business analysts and engineers to translate business requirements into an updated architecture that balance resources and performance
  • Improved the query speed from the original 20-30 seconds to less than 5 seconds by building secondary indexes to map TransactionID to one or more specific ElasticSearch indexes, then to query these indexes for detailed information
  • Further optimized the secondary index by cutting hot and cold data and storing the secondary index of that day in Redis
  • Developed Spark code to build an ETL pipeline based on Kafka including extracting data into Kafka, pulling data from Kafka topics, transforming data in KStream object, and loading data to Amazon S3

Environment: Spark 2.1.x, Kafka 1.0.x, ElasticSearch 1.6.x, Redis 4.0, KStream, Amazon S3


MRT Retail Data Analyst


  • Developed statistical methods (regression models and tree models) using R based on the past 5-year retail data to forecast sales in the next season and to determine the most valuable tire models for promotion to maximize ROI
  • Applied data wrangling techniques and data visualization tools using Python packages (Pandas, Numpy, and Seaborn) on historical data (~ 100 GB) to visualize the seasonality and trends of sales at the national level
  • Wrote ETL scripts for data profiling, cleaning, and aggregating data using MySQL
  • Designed and implemented Python code with the engineering team to automate data preprocessing, modeling, and testing process including standardizing different data types and formats
  • Performed ad-hoc analyses of sales data to identify demand trends and to shape strategic responses using Tableau

Environment: Python 3, R, Tableau, MySQL, Microsoft Excel, Microsoft PowerPoint

Hire Now