We provide IT Staff Augmentation Services!

Big Data Engineer Resume

5.00 Rating

Valley Forge, PA


  • Around 6 years of profession experience in Software Systems Development, Business Systems, experience in Big Data ecosystem related technologies with a master's degree in Information systems.
  • Experience in data management and implementation of Big Data applications using Spark and Hadoop frameworks.
  • Worked in analyzing data using Spark SQL, Hive QL and PIG Latin.
  • Hands on experience building streaming applications using Spark Streaming and Kafka with minimal/no data loss and duplicates.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimension modeling for OLAP.
  • Strong experience and noledge of HDFS, Map Reduce and Hadoop ecosystem components like Hive, Pig, Sqoop, NoSQL databases such as Mongo DB and Cassandra.
  • Familiarity with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2 instances, RDS and others.
  • Extensive work in Text Analytics, developing different Statistical Machine Learning, Data Mining solutions to various business problems and generating data visualizations using R, Python and Tableau.
  • Hands on experience in implementing LDA, Naïve Bayes and skilled in Random Forests, Decision Trees, Linear and Logistic Regression, SVM, Clustering, neural networks, Principle Component Analysis and good noledge on Recommender Systems.
  • Expertise in transforming business requirements into analytical models, designing algorithms, building models, developing datamining and reporting solutions dat scales across massive volume of structured and unstructured data.
  • Performed statistical & graphical analytics using NUMPY, PANDAS, MATPLOTLIB and BI tools such as Tableau.
  • Experience in using visualization tools like Tableau, ggplot2 and d3.js for creating dashboards.
  • Statistical Modelling with ML to bring Insights in Data under guidance of TEMPPrincipal Data Scientist.


Big Data Tools: HDFS, MapReduce, Hive, Pig, Hadoop StreamingLanguages: Pyhon, Scala, R

Tools: & Utilities: PyCharm, DataBricks, SQL server management studio

No SQL Databases: Cassandra & MongoDb

Machine learning: Decision trees, Random forest, Linear & Logistic regression, PCA, K - means, XG Boost and predictive analytics based algorithms

Data -Streaming: Batch Processing & Real-time streaming using KAFKA

AWS platform: Familiarity with cluster deployment using EC2 and EMR apart from using Storage platforms such as S3 along with a basic understanding of AWS Redshift.

O.S: Linux, Windows, Shell Scripting.


Confidential - Valley Forge, PA

Big Data Engineer


  • Simulated Credit risk scenarios and used logistic regression along with decision tree based ML algorithms to predict aforesaid output.
  • Transformed Kafka loaded data using Spark-streaming with Scala and Python.
  • Used Sci-kit learn, Pandas, Numpy and Tensor flow to determine insights from data and created a trained credit fraud detection model for batch data.
  • Created teh framework for teh dashboard using Tableau and optimized teh same using open source Google optimization tools.
  • Trained a Fraud Prediction model and created a customizable dashboard for credit monitoring agencies.
  • Developed SQOOP scripts to migrate data from Oracle to Big data Environment.
  • Extensively worked with Avro and Parquet files and converted teh data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.
  • Developed a Python Script to load teh CSV files into teh S3 buckets and created AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket.
  • Created Airflow Scheduling scripts in Python to automate teh process of Sqooping wide range of data sets.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
  • Converted all Hadoop jobs to run in EMR by configuring teh cluster according to teh data size
  • Collated Real-time streaming data from credit agencies such as Transunion & Experian, performed data cleaning and fed teh data into Kafka.
  • Deployed model using RESTful APIs and used Dockers to facilitate multi-environment transition.
  • Streaming data was stored using Amazon S3 deployed over EC2 and EMR cluster framework apart from in-house tools.

Environment: Podium Data, Data Lake, HDFS, Hue, AWS S3, Impala, Spark, Scala, Kafka, Looker, AWS EC2 and EMR

Confidential -NY, NY

Big Data Engineer


  • Cleaned and congruous data was then streamed using Kafka into Spark and manipulations were performed on real time data with Python and Scala.
  • Built teh Machine learning based coupon purchase recommendation engine by training teh model on historical purchase data of customers across teh retail hemisphere.
  • Simulated real-time scenarios using teh Sci-kit learn and Tensor flow libraries on Batch data for training model with teh resulting model being used in real-time models.
  • Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations.
  • Imported data from different sources into Spark RDD for processing.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Worked on installing cluster, commissioning & decommissioning of Data node, Name node high availability, capacity planning, and slots configuration.
  • Developed Spark applications for teh entire batch processing by using Scala.
  • Automatically scale-up teh EMR instances based on teh data.
  • Stored teh time-series transformed data from teh Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
  • Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment.
  • Visualized teh results using Tableau dashboards and teh Python Seaborn libraries were used for Data interpretation in deployment.

Environment: Spark, AWS, EC2, EMR, Hive, MS SQL Server, Genie Logs, Kafka, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Tableau


Big Data engineer


  • Responsible for penetration testing of corporate networks and simulated virus infections on computers to assess network security & Presented a report to teh Development Team to assess teh intrusions.
  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into teh Hive tables and handled structured data using Spark SQL.
  • Loaded teh data into Spark RDD and did teh memory data computation to generate teh Output response.
  • Developed Spark scripts by using Scala shell commands as per teh requirement.
  • Worked on teh Pub-Sub system using Apache Kafka during teh ingestion process.
  • Worked on automating teh flow of data between software systems using Apache NiFi.
  • Prepared workflows for scheduling teh load of data into Hive using IBIS Connections.
  • Worked on a robust automated framework in Data Lake for metadata management dat integrates various metadata sources, consolidates and updates podium with latest and high quality metadata using teh big data technologies like Hive and Impala.
  • Coordinated with product leads to identify problems with Norton products and acted as a liaison between teh Development Team and Quality Team to ascertain teh efficiency of teh product
  • Handled network intrusion data and manipulated spark jobs on teh same data to identify most common threats and analyze teh aforesaid issues.

Environment: Hadoop (Cloudera Stack), Hue, Spark, Kafka, HBase, HDFS, Hive, Pig, Sqoop


Data Engineer


  • Worked as Data Engineer with Confidential Technical Support for over 10000 US customers
  • Reported common issues faced with Confidential products and fostered feasible solutions making least changes in physical design of teh product and rendering maximum throughput by alleviating teh defects
  • Created a data-profiling dashboard by leveraging podium internal architecture, which drastically reduced teh time to analyze data quality using Looker reporting.
  • Handled an analytical model for automating data certification in Data Lake using Impala.
  • Worked on an input agnostic framework for data stewards to handle their ever-emerging work group datasets and created a business glossary by consolidating them using Hive.
  • Created technical design documentation for teh data models, data flow control process and metadata management.
  • Reversed Engineered and generated teh data models by connecting to their respective databases.
  • Worked on a robust comparison process to compare data modelers' metadata with data stewards' metadata and identify anomalies using Hive and Podium Data.

Environment: Hadoop (Hortonworks stack), HDFS, Oozie, Pig, Hive, MapReduce, Sqoop, Linux

We'd love your feedback!