We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Memphis, TN


  • 8+ years of overall IT experience in a variety of industries, this includes hands - on experience in Big Data Analytics and development.
  • Experience in collecting, processing, and aggregating large amounts of streaming data using Kafka, Spark Streaming.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing datamining, and reporting solutions that scales across massive volume of structured and unstructured data.
  • Good Knowledge on Apache NiFi for automating and managing the data flow between systems.
  • Experience in designing Data Marts by following Star Schema and Snowflake Schema Methodology.
  • Experienced in understanding distributed ecosystem.
  • Experience in data management and implementation of Big Data applications using Spark and Hadoop frameworks.
  • Experience in analyzing data using Spark SQL, Hive QL and PIG Latin.
  • Familiarity with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2 instances, RDS and others.
  • Hands on experience building streaming applications using Spark Streaming and Kafka with minimal/no data loss and duplicates.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimension modeling for OLAP.
  • Strong experience and knowledge of HDFS, Map Reduce and Hadoop ecosystem components like Hive, Pig, Sqoop, NoSQL databases such as Mongo DB and Cassandra.
  • Extensive work in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and Tableau.
  • Hands on experience in implementing LDA, Naïve Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principal Component Analysis and good knowledge on Recommender Systems.
  • Performed statistical & graphical analytics using NUMPY, PANDAS, MATPLOTLIB and BI tools such as Tableau.
  • Experience in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
  • Statistical Modelling with ML to bring Insights in Data under guidance of Principal Data Scientist.


BigData/Hadoop Technologies: MapReduce, Spark, SparkSQL, Azure, Spark Streaming, Kafka, PySpark, Pig, Hive, HBase, Flume, Flink, Yarn, Oozie, Zookeeper, Hue, Ambari Server

Languages: HTML5, DHTML, WSDL, CSS3, C, C++, XML, R/R Studio, SAS Enterprise Guide, SAS, R (Caret, Weka, ggplot), Perl, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python (NumPy, SciPy, Pandas, Gensim, Keras), Java Script, Shell Scripting

NO SQL Databases: Cassandra, HBase, MongoDB, MariaDB

Web Design Tools: HTML, CSS, JavaScript, JSP, jQuery, XML

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Autoscaling, CloudWatch, Route53, EMR, RedShift, Glue, Athena, SageMaker.

Orchestration tools: Oozie, Airflow.

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris


Confidential, Memphis, TN

Big Data Engineer


  • Implemented Restful web service to interact with Redis Cache framework.
  • Intake happens through Sqoop, and Ingestion happens through Map Reduce, HBASE.
  • Extensively worked on Spark Streaming and Apache Kafka to fetch live stream data.
  • Interacting with other data scientists and architected custom solutions for data visualization using tools like Tableau, packages in R.
  • Developed predictive models using Python & R to predict customers churn and classification of customers.
  • Documenting the best practices and target approach for CI/CD pipeline
  • Constructed product-usage SDK data and data aggregations by using PySpark, Scala, Spark SQL and Hive context in partitioned Hive external tables maintained in AWS S3 location for reporting, data science dashboarding, and ad-hoc analyses.
  • Involved in data processing using an ETL pipeline orchestrated by AWS Data Pipeline using Hive.
  • Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.
  • Handled the importing of data from various data sources, performed transformations using hive,Map-Reduce, loaded data intoHDFSand extracted data fromMySQLintoHDFSusingSqoop. sources, channels and sink by which data is ingested into HDFS
  • Responsible for performing various transformations like sort, join, aggregations, filter in-order to retrieve various datasets using apache spark.
  • Experience in extracting appropriate features from datasets in-order to handle bad, null, partial records using spark SQL.
  • Developed various data loading strategies and performed various transformations for analyzing the datasets by using Hortonworks Distribution for Hadoop ecosystem.
  • Wrote Spark RDD transformations, actions, Data Frames, case classes for the required input data ad performed the data transformations using Spark Context to convert RDD to Data frames
  • Worked on storing the dataframe into hive as table using Python (PySpark).
  • Experienced in ingesting data into HDFS from various Relational databases like Teradata using sqoop and exported data back to Teradata for data storage.
  • Experience in developing various spark application using Spark-shell (Scala).
  • Involved in creating Hive Tables, loading with data, and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ.
  • Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Extracted data from multiple sources, applied transformations, loaded data into HDFS.
  • Migrated ETL jobs to Pig scripts do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
  • Hands on experience in developing apache SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
  • Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts
  • Implemented the workflows using Apache Oozie framework to automate tasks
  • Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
  • Exported data to Cassandra (NoSQL) database from HDFS using sqoop and performed various CQL commands on Cassandra to obtain various datasets as required.
  • After performing all the transformations data is stored in MongoDB (NOSQL)using Sqoop.
  • Created and imported various collections, documents into MongoDB and performed various actions like query, project, aggregation, sort, limit.
  • Involved in Unit testing and delivered Unit test plans and results documents using Junit and MRUnit.

Environment: Hadoop, HDFS, Map Reduce, spark, Sqoop, Oozie, Pig, Kerberos, Hive, Flume, TEZ, LINUX, Java, Eclipse, Cassandra, python, MongoDB.

Confidential, Memphis, TN

Big Data Engineer


  • Performed ETL on data from different formats like JSON, Parquet.
  • Analyzed the data by performingHivequeries (HiveQL) and runningPig Scripts(Pig Latin).
  • Used Oozie Scheduler systems to automate the pipeline workflow and orchestrate the Spark jobs.
  • Worked in Loading and transforming large sets of structured, semi structured, and unstructured data.
  • Involved in collecting, aggregating, and moving data from servers to HDFS using Flume.
  • Collecting data from various Flume agents that are imported on various servers using Multi-hop Flow.
  • Created Hive UDFs and UDAFs using python scripts & Java code based on the given requirement
  • Automated all the jobs to pull the data and load into Hive tables, using Oozie workflows
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Knowledge on microservices architecture in spring Boot integrating with various restful webservices.
  • Created and maintained technical documentation for launching Hadoop Clusters and for executing Pig Scripts.
  • Developed a Python Script to load the CSV files into the S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Developed SQOOP scripts to migrate data from Oracle to Big data Environment.
  • Transformed Kafka loaded data using Spark-streaming with Scala and Python.
  • Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.
  • Transformed Kafka loaded data using Spark-streaming with Scala and Python.
  • Used Sci-kit learn, Pandas, Numpy and Tensor flow to determine insights from data and created a trained credit fraud detection model for batch data.
  • Created the framework for the dashboard using Tableau and optimized the same using open source Google optimization tools.
  • Developed SQOOP scripts to migrate data from Oracle to Big data Environment.
  • Created Airflow Scheduling scripts in Python to automate the process of Sqooping wide range of data sets.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size
  • Collated Real-time streaming data from credit agencies such as Transunion & Experian, performed data cleaning and fed the data into Kafka.
  • Deployed model using RESTful APIs and used Dockers to facilitate multi-environment transition.
  • Streaming data was stored using Amazon S3 deployed over EC2 and EMR cluster framework apart from in-house tools.

Environment: Podium Data, Data Lake, HDFS, Hue, AWS S3, Impala, Spark, Scala, Kafka, Looker, AWS EC2 and EMR


Big Data Engineer


  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Written multiple Hive UDFS using Core Java and OOP concepts and spark functions within Python programs.
  • Wrote Spark applications for Data validation, cleansing, transformations, and custom aggregations.
  • Imported data from various sources into Spark RDD for processing.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Worked on installing cluster, commissioning & decommissioning of Data node, Name node high availability, capacity planning, and slots configuration.
  • Cleaned and congruous data was then streamed using Kafka into Spark and manipulations were performed on real time data with Python and Scala.
  • Built the Machine learning based coupon purchase recommendation engine by training the model on historical purchase data of customers across the retail hemisphere.
  • Simulated real-time scenarios using the Sci-kit learn and Tensor flow libraries on Batch data for training model with the resulting model being used in real-time models.
  • Developed Spark applications for the entire batch processing by using Scala.
  • Automatically scale-up the EMR instances based on the data.
  • Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
  • Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
  • Visualized the results using Tableau dashboards and the Python Seaborn libraries were used for Data interpretation in deployment.

Environment: Spark, AWS, EC2, EMR, Hive, MS SQL Server, Genie Logs, Kafka, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Tableau


Big Data engineer


  • Worked on automating the flow of data between software systems using Apache NiFi.
  • Prepared workflows for scheduling the load of data into Hive using IBIS Connections.
  • Worked on a robust automated framework in Data Lake for metadata management that integrates various metadata sources, consolidates, and updates podium with latest and high-quality metadata using the big data technologies like Hive and Impala.
  • Responsible for penetration testing of corporate networks and simulated virus infections on computers to assess network security & presented a report to the Development Team to assess the intrusions.
  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into the Hive tables and handled structured data using Spark SQL.
  • Loaded the data into Spark RDD and did the memory data computation to generate the Output response.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Worked on the Pub-Sub system using Apache Kafka during the ingestion process.
  • Coordinated with product leads to identify problems with Norton products and acted as a liaison between the Development Team and Quality Team to ascertain the efficiency of the product
  • Handled network intrusion data and manipulated spark jobs on the same data to identify most common threats and analyze the aforesaid issues.
  • Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
  • Used Databricks to integrate easily with the whole Microsoft stack.
  • Wrote spark SQL and spark scripts(pyspark) in databricks environment to validate the monthly account level customer data.
  • Creating Spark clusters and configuring high concurrency clusters using Azure Databricks (ADB) to speed up the preparation of high-quality data.
  • Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real-time analytics streaming, Sqoop, pig, hive, and Cosmos DB for batch jobs.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.

Environment: Hadoop (Cloudera Stack), Hue, Azure, Databricks,Spark, Kafka, HBase, HDFS, Hive, Pig, Sqoop

Hire Now