We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Kansas, CitY


  • Extensive IT experience with variety of industries in different stages of software development and having experience with Cloudera 4, Hortonworks 2.6 data distributions and Hadoop environment including Hadoop 2.8.3, Hive 1.2.2, Sqoop 1.4.7, MapReduce V1 &V2, HBase 2.0, Apache Spark 1.6 & 2.x, Apache Kafka 1.3.2 and Impala 2.1.0.
  • Hands on experience with Hadoop eco system components like MapReduce, Spark, Hive, Impala, HDFS, YARN, Apache HBase, Sqoop, Hue, Apache Kafka, Apache Storm, Apache NiFi, Spark SQL.
  • Extensive experience in developing MapReduce Jobs using Java and Maven as well as thorough understanding of MapReduce infrastructure framework.
  • Hands on experience in working with Spark SQL, Data Frame API’s, RDD’S and import data from Hadoop Distributed File System, perform transformations and save the results back to HDFS.
  • Experience with accessing meta store tables and performed analysis, applied wide transformations on top of Hive data tables using HiveQL/SQL with the help of Spark Context and Hive Context.
  • In - depth work experience with various file formats like XML, Json, AVRO, Parquet, Sequence files, Tab Delimited files, Text files, ORC files and different types of compression techniques such as Snappy, gzip, Deflate, bzip2, LZ4, LZO, Zstandard.
  • In-depth knowledge of the Big Data Architecture along with its various versions and components of Hadoop 1.X and 2.X such as HDFS, Job Tracker, Task Tracker, Data Node, Name Node and YARN, Resource Manager, Node Manager.
  • Capable of processing huge amount of both Structured, semi-structured and un-structured data.
  • Strong experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouse and into Azure Data Lake using Talend, SSIS.
  • Experience with ETL workflow Management tools like Apache Airflow and have significant experience in writing the python scripts to implement the workflow.
  • Developed Shell and Python scripts to automate and provide Control flow to Pig scripts. Imported data from Linux file system to HDFS.
  • Solid ability to querying and optimize diverse SQL Data Bases like MySQL, Oracle, Postgres and NoSQL Data bases like Apache HBase, Cassandra.
  • Developed and customized the system to collect data from multiple portal using Kafka and then process it using spark and Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions.
  • Optimized downstream data processing using AWS tools like S3, Kinesis Data Strems, Kinesis Data Firehose and Amazon EMR.
  • Worked on Build Management tools like SBT, Maven and Version control tools like Git.
  • Experience with Data analyzing and reporting tools like Tableau, Power BI and SSRS.
  • Used Integrated Development Environments like Eclipse, IntelliJ IDEs, Pycharm, Jupyter, Spyder, Edit plus, Notepad ++ etc., for application development
  • Work experience with various Software Development Life Cycle (SDLC) methodologies like Agile, Waterfall model.
  • Effective team player with strong communication and interpersonal skills, possess a strong ability to adopt and learn new technologies and new business lines rapidly.


Hadoop Eco System\ Cloud Technologies: Hadoop, MapReduce, Spark, HDFS, Sqoop, \ Azure, Kubernetes, AWS (S3, EC2, EMR, YARN, Oozie, Hive, Impala, Apache Flume, nesis, Firehose) Apache Storm, Apache Airflow, Hbase\

Programming Languages\ Cluster Mgmt & monitoring: Java, Python 3, Scala 2.12.8, PySpark, C, C++\ CDH 4, CDH 5, Horton Works Ambari 2.5\

Data Bases\ NoSQL Data Bases: MySQL, SQL Server, Oracle, MS Access \ MongoDB, Cassandra, HBase, KairosDB Workflow mgmt tools\ Visualization & ETL tools Oozie, Apache Airflow\ Tableau, BananaUI, D3.js, Informatica, Talend\

IDE s\ Operating Systems: Eclipse, Jupyter notebook, Spyder, PyCharm, \ Unix, Linux, Windows IntelliJ

Version Control Systems: Git, SVN\



Big Data Engineer


  • Triggered the data into HDFS from different sources like networks, Relational Data Bases, end points using an automate ingestion tool Apache NiFi.
  • Using Apache Hadoop Processed huge amount both structured and un-structured (web logs, xml data) customer data collected from various network devices and systems.
  • By using Kafka HDFS connector load the data to the Hadoop clusters and integrate with Hive.
  • Applied the transformation techniques using Spark on top of HDFS data to get the correlation between the data from various sources.
  • Used Hive to perform partitioning and Bucketing the data present in Hive for getting and drawing the conclusions from the correlated data.
  • Used Spark Data Frame API to get the analysis fast using Hive Context and handover the data to the machine learning analytics team based on the requirement.
  • Analyzed the data present in HDFS by importing the data into Spark by using Hive Context.

Environment: MySQL 5.X, Hadoop 2, Spark 1.6, Hive 2.X, MapReduce, HDFS, Python 3, Scala 2.12, Kafka

Confidential - Kansas City, MO

Hadoop/ Spark Developer


  • Designed and implemented ETL pipelines between from various Relational data Bases to the Data Warehouse using Apache Airflow.
  • Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the Unix operating system as well.
  • Used Apache NiFi to ingest the real time OLTP data from disparate sources to HDFS and Kafka.
  • Involved in moving the generated huge amount of structured and un-structured data, log data from the various sources to Hadoop clusters for further processing.
  • Used Spark streaming to queries the data periodically from Kafka topics and load into kairosDB.
  • Using spark batch processed the data from HDFS and load into KairosDB and other information into HDFS.
  • Used Avro, parquet file formats to serialize the data.
  • Using Expo UI got the metrics from KairosDB and load the data from Kairos DB to Cassandra.
  • Developed spark scripts using Scala to perform data transformations, DataFrames and running SparkSQL from the Cassandra using Spark-Cassandra connector.
  • Developed scripts to sort, join, grouping, filter the enterprise wise data and perform same operations on Spark using Spark-SQL to observe the latencies.
  • Experience in daily production support to monitor and trouble shoot Hadoop/Hive jobs.
  • Configured the core-site.xml and mapred-site.xml according to the multimode cluster environment
  • Implemented data integrity and data quality checks in Hadoop using Hive and Linux scripts.

Environment: Spark 1.6, Python 3, Scala 2.11, Cassandra, Kafka 1.3, SparkSQL, DataFrame API’s, Hive, KairosDB, Cassandra, Expo UI, Datastax

Confidential - Kansas City

Big Data Developer


  • Ingested the data from various IOT devices into Kafka clusters using Kafka API.
  • Monitored the Kafka clusters and configured Zookeeper in standalone mode.
  • Installed Kafka manager for consumer lags and for monitoring Kafka Metrics, this has been used for adding topics and partitions etc.
  • Successfully generated consumer group lags from Kafka using Kafka API.
  • Configured Spark Streaming loaded data from Kafka producers, implemented streaming context using SparkConf object.
  • Processed the IOT data regarding the vehicles which are in the radius of defined Point-of-interest (POI) using Spark Streaming.
  • Using IoT data producer to generate IoT data in JSON format, wrote the custom Kafka serializer class that serialized IoT data object.
  • Developed the custom de-serializer class which will deserialize IoT data JSON String to IoT data object.
  • Obtained DStream of IoT data objects by transforming the map operation and created key-value pair in Spark.
  • Transferred the streaming data coming from Spark Streaming to Cassandra DB as of Json data.
  • Developed Spark code in Scala and SparkSQL environment for faster testing and processing of data and Loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
  • Performed analysis on top of data present in Cassandra using Apache Drill.
  • Used Datastax’s Spark Cassandra connector library to load the data in Cassandra database and provided API to save DStream and RDD to Cassandra.
  • Created Tableau dashboard based on Cassandra data, and used the Spark ODBC driver to integrate Cassandra and Spark.

Environment: Kafka 1.3.2, Apache Spark 1.6, Zookeeper 3.4.8, Scala 2.11, Maven, Tableau, Apache Drill, Spark-SQL, Spark RDD’s


Big Data Developer


  • Loaded data from MySQL server to the Hadoop clusters using the data ingestion tool Sqoop.
  • By using Apache Flume loaded real time unstructured data like xml data, log files into HDFS.
  • Processed large amount both structured and unstructured data using MapReduce framework.
  • Designed solution to perform ETL tasks like data acquisition, data transformation, data cleaning and efficient data storage on HDFS
  • Involved in creating Hive tables, loading with data and writing hive queries on top of data present in HDFS.
  • Worked on tuning the performance Pig queries. Involved in Developing the Pig scripts for processing data.
  • Written Hive queries to transform the data into tabular format and process the results using Hive Query Language.
  • Developed Spark code using Scala and Spark Streaming for faster testing and processing of data.
  • Store the resultant processed data back into Hadoop Distributed File System.
  • Applied machine learning algorithms (K- nearest Neighbors, random forest) using Spark MLib on top of HDFS data and compare the accuracy between the models.
  • Used Tableau to get the visualizations on data outcome from the ML algorithms.

Environment: Apache Sqoop, Apache Flume, Hadoop, MapReduce, Spark, Hive, pig, Spark MLib, Tableau


Python/SQL Developer


  • Worked on both backend and front end in the application development major in back end.
  • Developed user interface for register page, login page for the application using HTML, CSS and Java Script.
  • Implemented the database to store the questions, possible answers, correct answers, Scores of users and queried using SQL.
  • Queried MYSQL database queries from python using Python - MySQL connector and MySQL DB package to retrieve the information.
  • Developed wrapper in python for instantiating multi-threaded application.
  • Created and optimized diverse SQL queries to validate accuracy of data to ensure database integrity.
  • As SQL Server Developer worked closely with Application Developers to ensure proper design and implementation of database systems

Environment: Python, Anvil, HTML, CSS, JavaScript, MySql, SSIS, SQL Server, GitHub

Hire Now