We provide IT Staff Augmentation Services!

Big Data Engineer Resume


  • 9+ years of professional experience in Design, Development, Implementation, Deployment and support of business applications using Java/J2EE Technologies and big data technologies.
  • 4+Years of experience in Big dataHadoop,HadoopEcosystem components like MapReduce, Sqoop, Flume, Kafka, Pig, Hive, Spark, Storm, HBase, Oozie, and Zookeeper.
  • Having good experience inHadoopframework and related technologies like HDFS, MapReduce, Pig, Hive, HBase, Sqoop and Oozie.
  • Hands of experience on data extraction, transformation and load in Hive, Pig and HBase.
  • Experience in creating D - Streams from sources like Flume, Kafka and performed different Spark transformations and actions on it.
  • Experience in integrating Apache Kafka with Apache Storm and created Storm data pipelines for real time processing.
  • Worked on improving the performance and optimization of the existing algorithms inHadoopusing Spark context, Spark-SQL, Data Frames, RDD's, Spark YARN.
  • Delivery experience on majorHadoopecosystem Components such as Pig, Hive, Spark, Kafka, Elastic Search & HBase and monitoring with Cloudera Manager. Extensive working experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
  • Extensive experience in using, APACHE STORM, APACHE SPARK & KAFKA MAVEN and ZOOKEEPER.
  • Experienced data pipelines using Kafkafor handling large terabytes of data.
  • Hands on experience onSOLRto Index the files directly from HDFS for both Structured and Semi Structured data.
  • Strong experience in RDBMS technologies like MySQL, Oracle,Postgresand DB2.
  • Training and Knowledge in Mahout, Spark MLlib for use in data classification, regression analysis, recommendation engines and anomaly detection.
  • Experienced in Developing Spark application using Spark Core, Spark SQL and Spark Streaming API's.
  • Involved in configuring and working with Flume to load the data from multiple sources directly intoHDFS.
  • Hands-on experience with Hortonworks & Cloudera DistributedHadoop(CDH)
  • Experience on cloud infrastructure like Amazon Web Services (AWS).
  • Experience on predictive intelligence and smooth maintenance in Spark streaming is done using Conviva and MLlib from Spark.
  • Expertise on Spark engine creating batch jobs with incremental load through HDFS/S3,KINESIS, Sockets, AWS etc.
  • Imported data using Sqoop to load data from MySQL to S3 Buckets on regular basis.
  • Worked on Amazon AWS concepts like EMR and EC2 web services for fast and efficient processing of Big Data.
  • Expert in JAVA 1.8LAMBDAS, STREAMS, Type annotations.
  • Experience in deployment of Big Data solutions and the underlying infrastructure of Hadoop Cluster using Cloudera, MapR and Hortonworks distributions.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs,Pythonand Scala.
  • Performed analytics in Hive using various files format like JSON, Avro, ORC, andParquet.
  • Experience in NoSQL databases like HBase, Cassandra, Redis and MongoDB.
  • Experience in using design pattern, Java, JSP, Servlets, JavaScript, HTML, jQuery, Angular JS, Mobile jQuery, JBOSS 4.2.3, XML, Web Logic, SQL, PL/SQL, Junit and Apache-Tomcat, Linux.
  • Good Knowledge in Amazon AWS computing like EC2 web services which provides fast and efficient processing of Big Data.
  • Experience on working with EMR for data visualization


Hadoop technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Apache Nifi, Zookeeper.

NoSQL Databases: MongoDB, HBase, Cassandra

Real time/Stream processing: Apache Spark

Distributed message broker: Apache Kafka

Monitoring and Reporting: Tableau, Zeppelin Note Book

Hadoop Distribution: Cloudera, Horton Works, AWS (EMR)

Build Tools: Maven, SBT

Cloud Technologies: AWS Glacier, S3

Programming & Scripting: JAVA, C, SQL, Shell Scripting, Python

Java Technologies: Servlets, JavaBeans, JDBC, Spring, Hibernate, SOAP/Restful services

Databases: Oracle, MY SQL, MS SQL server, Teradata

Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript, angular JS

Tools: & Utilities: Eclipse, Net Beans, SVN, CVS, SOAP UI, MQ explorer, RFH utilJMX explorer, SSRS, Aqua Data Studio, XML Spy, ETL (Talendpentaho), IntelliJ(Scala)



Big Data Engineer


  • Responsible for building scalable distributed data solutions using Hadoop components.
  • Solid Understanding of Hadoop HDFS, Map-Reduce and other Eco-System Projects.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Knowledge of architecture and functionality of NOSQL DB like HBase.
  • Used S3 for data storage, responsible for handling huge amounts of data.
  • Used EMR for data pre-analysis by creating EC2 instances.
  • Used Kafka for obtaining the near real time data.
  • Good experience in writing data ingesters likes Sqoop.
  • Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API, Data Frames and Pair RDD's for faster processing of data and created RDD's, Data Frames and datasets.
  • Batch-processing is done by using Spark implemented by Scala.
  • Extensive data validation using HIVE and also written Hive UDFs.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and Pig jobs.
  • Used Cloudera data platform for deploying Hadoop in some modules.
  • Involved in creating Hive tables loading with data and writing hive queries which will run internally in map reduce waylots of scripting (python and shell) to provision and spin up virtualized Hadoopclusters.
  • Developed and Configured Kafka brokers to pipeline server logs data into Spark streaming.
  • Created external tables pointing to HBase to access table with huge number of columns.
  • Involved in loading the created HFiles into HBase for faster access of large customer base without taking Performance hit.
  • Configured TALEND ETL tool for some data filtering,
  • Processed the data in HBase using Apache Crunch pipelines, a map-reduce programming model which is efficient for processing AVRO data formats.
  • Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume.
  • Used Tableau for data visualization and generating reports.
  • Configured Kerberos for the clusters.
  • Used Apache SOLR for indexing in HDFS.
  • Integration with RDBMS using Sqoop and JDBC Connectors.
  • Used different file formats like Text files, Sequence Files, Avro, and CSV.
  • Extensive experience in writing UNIX shell scripts and automation of the ETL processes using UNIX shell scripting.

Environment: UNIX, Linux Java, Apache HDFS Map Reduce, Spark, Pig, Hive, HBase, Kafka, Sqoop, NOSQL, AWS (S3 buckets), EMR cluster, SOLR.

Confidential, Hartford, CT

Big Data Engineer


  • Used Hortonworks distribution for Hadoop ecosystem
  • Created Sqoop jobs for importing the data from Relational Database systems into HDFS and also used to dump the result into the data bases using Sqoop.
  • Extensively used Pig for data cleansing using Pig scripts and Embedded Pig scripts.
  • Developed in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
  • Written python scripts to analyze the data of the customer.
  • Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
  • Captured the data logs from web server into HDFS using Flume for analysis.
  • Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
  • Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
  • Involved in migrating MapReduce jobs into Spark jobs and used SparkSQL and Data frames API to load structured data into Spark clusters
  • Use Data frames for data transformations using RDD
  • Designed and Developed Spark workflows using Scala for data pull from cloud-based systems and applying transformations on it.
  • Using Spark streaming consumes topics from distributed messaging source Event hub and periodically pushesbatch of data to Spark for real time processing
  • Tuned Cassandra and MySQL for optimizing the data.
  • Implemented monitoring and established best practices around usage of elastic search
  • Used Spark API overHortonworks HadoopYARN to perform analytics on data in Hive.
  • Hands-on experience with Hortonworkstools like Tez and Ambari.
  • Worked on Apache Nifi as ETL tool for batch processing and real time processing.
  • Fetch and generate monthly reports. Visualization of those reports using Tableau.
  • Developed Tableau visualizations and dashboards using Tableau Desktop
  • Extracted files from Cassandra through Sqoop and placed in HDFS for further processing.
  • Strong working experience on Cassandra for retrieving data from Cassandra clusters to run queries.
  • Experience in Data modelling using Cassandra.
  • Very good understanding Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
  • Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Casandra tables for quick searching, sorting and grouping.
  • Worked with BI (Business Intelligence) teams in generating the reports and designing ETL workflows on Tableau. Deployed data from various sources into HDFS and building reports using Tableau.
  • Extensively in creating Map-Reduce jobs to power data for search and aggregation.
  • Managed Hadoop jobs by DAG using Oozie workflow scheduler.
  • Involved in developing code to write canonical model JSON records from numerous input sources to Kafka Queues.
  • Involved in loading data from Linux file systems, servers, java web services using Kafka producers and consumers.

Environment: Hadoop, Hive, Map Reduce, Sqoop, Spark, Eclipse, Maven, Java, agile methodologies, AWS, Tableau, Pig, Elastic search


Hadoop Developer


  • Experience in Importing and exporting data into HDFS and Hive using Sqoop.
  • Developed Flume Agents for loading and filtering the streaming data into HDFS.
  • Experienced in handling data from different data sets, join them and preprocess using Pig join operations.
  • Moving Bulk amount data into HBase using Map Reduce Integration.
  • Developed Map-Reduce programs to clean and aggregate the data
  • Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
  • Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
  • Strong understanding of Hadoop eco system such as HDFS, MapReduce, HBase, Zookeeper, Pig, Hadoop streaming, Sqoop, Oozie and Hive.
  • Implement counters on HBase data to count total records on different tables.
  • Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
  • Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.
  • Worked with MongoDB for developing and implementing programs in Hadoop Environment
  • Extracted and restructured the data into MongoDB using import and export command line utility tool.
  • Used MongoDB as part of POC and migrated few of the stored procedures in SQL to MongoDB
  • Worked with NOSQL databases (MongoDB) and Hybrid implementations.
  • Implemented secondary sorting to sort reducer output globally in map reduce.
  • Implemented data pipeline by chaining multiple mappers by using Chained Mapper.
  • Created Hive Dynamic partitions to load time series data
  • Worked on Impala for creating views for business use-case requirements on top of Hive tables.
  • Experienced in handling different types of joins in Hive like Map joins, bucker map joins, sorted bucket map joins.
  • Created tables, partitions, buckets and perform analytics using Hive ad-hoc queries.
  • Experienced import/export data into HDFS/Hive from relational data base and Tera data using Sqoop.
  • Handling continuous streaming data comes from different sources using flume and set destination as HDFS.
  • Integrated spring schedulers with Oozie client as beans to handle cron jobs.
  • Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters
  • Actively participated in software development lifecycle (scope, design, implement, deploy, test), including design and code reviews.
  • Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.

Environment: Hadoop, HDFS, Map Reduce, Hive, Impala, Pig, HBase, MongoDB Sqoop, RDBMS/DB, Flat files, MySQL, CSV, Avro data files.

Hire Now