We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Pleasanton, CA


  • Having 8+ years of experience in IT industry implementing, developing and maintenance of various Web Based applications using Java, J2EE Technologies and Big Data Ecosystem.
  • Strong knowledge of Hadoop Architecture and Daemons such as HDFS, JOB Tracker, Task Tracker, Name None, Data Node and Map Reduce concepts.
  • Well versed in implementing E2E solutions on big data using Hadoop frame work.
  • Hands on experience in writing Map Reduce programs using Java to handle different data sets using Map and Reduce tasks .
  • Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
  • Worked with join patterns and implemented Map side joins and Reduce side joins using Map Reduce.
  • Developed multiple MapReduce jobs to perform data cleaning and preprocessing.
  • Designed HIVE queries & Pig scripts to perform data analysis, data transfer and table design.
  • Having experience in developing a data pipeline using Kafka to store data into HDFS.
  • Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3),EMR, and Amazon Elastic Compute Cloud (Amazon EC2).
  • Implemented Ad - hoc query using Hive to perform analytics on structured data.
  • Expertise in writing Hive UDF, Generic UDF's to in corporate complex business logic into Hive Queries .
  • Experienced in optimizing Hive queries by tuning configuration parameters.
  • Involved in designing the data model in Hive for migrating the ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment.
  • Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
  • Extensively used Apache Flume to collect the logs and error messages across the cluster.
  • Experienced in performing real time analytics on HDFS using HBase.
  • Used Cassandra CQL with Java API’s to retrieve data from Cassandra tables.
  • Experience in composing shell scripts to dump the shared information from MySQL servers to HDFS.
  • Worked on Implementing and optimizing Hadoop/MapReduce algorithms for Big Data analytics.
  • Experience in working with Amazon EMR, Cloudera (CDH3 & CDH4 ) and Horton Works Hadoop Distributions
  • Experience in meeting expectations with Hadoop clusters using Cloudera (CDH3 &CDH4) and Horton Works.
  • Worked with Oozie and Zoo-keeper to manage the flow of jobs and coordination in the cluster.
  • Experience in performance tuning, monitoring the Hadoop cluster by gathering and analyzing the existing infrastructure using Cloudera manager.
  • Good knowledge in writing Spark application using Python and Scala.
  • Experience processing Avro data files using Avro tools and MapReduce programs.
  • Implemented pre-defined operators in spark such as map, flat Map, filter, reduceByKey, groupByKey, aggregateByKey and combineByKey etc.
  • Used Scala sbt to develop Scala coded spark projects and executed using spark-submit
  • Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
  • Experienced writing Test cases and implement unit test cases using testing frame works like Junit, Easy mock and Mockito.
  • Worked on Talend Open Studio and Talend Integration Suite.
  • Adequate knowledge and working experience with Agile and waterfall methodologies.
  • Good understanding of all aspects of Testing such as Unit, Regression, Agile, White-box, Black-box.
  • Expert in developing applications using Servlets, JPA, JMS, Hibernate, spring frameworks.
  • Extensive experience in implementing/ consume Rest Based Web Services.
  • Good knowledge of Web/Application Servers like Apache Tomcat, IBM WebSphere and Oracle WebLogic.
  • Ability to work with onsite and offshore team members.
  • Able to work on own initiative, highly proactive, self-motivated commitment towards work and resourceful.


Big Data Ecosystems: Hadoop Map Reduce, HDFS, Zookeeper, Hive Pig, Sqoop, Oozie, Flume, Yarn, Spark

Database Languages: SQL, PL/SQL, Oracle

Programming Languages: Java, Scala

Frameworks: Spring, Hibernate, JMS

Scripting Languages: JSP, Servlets, JavaScript, XML, HTML, Python

Web Services: RESTful web services

Databases: RDBMS, HBase, Cassandra

IDE: Eclipse, IntelliJ

Platforms: Windows, Linux, Unix

Application Servers: Apache Tomcat, Web Sphere, Web logic, JBoss

Methodologies: Agile, Waterfall

ETL Tools: Informatica, Talend


Confidential, Pleasanton, CA

Big Data Engineer


  • Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
  • Experience in implementing Spark RDD’s in Scala.
  • Experience building bi-directional data pipelines between Oracle 11g and HDFS using Sqoop.
  • Experience developing MapReduce programs in Java and optimizing MapReduce algorithms using Custom Partitioner, Custom Shuffle, and Custom Sort.
  • Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
  • Designed and Built Spark Streaming application which analyses and evaluates the Streaming data against the business rules through Rules Engine and then send alerts to the business users to address the customer preferences and do product promotions.
  • Parsed JSON and XML files with Pig Loader functions and extracted insightful information from Pig Relations by providing a regex using the built-in functions in Pig.
  • Experience writing Pig Latin scripts for Data Cleansing, ETL operations and query optimization of existing scripts.
  • Experience creating Hive tables, loading tables with data and aggregating data by writing Hive queries.
  • Performed Schema design for Hive and optimized the Hive configuration.
  • Experience writing reusable custom Hive and Pig UDFs in Java and using existing UDFs from Piggybanks and other sources.
  • Worked in AWS environment for development and deployment of custom Hadoop applications.
  • Strong experience in working with ELASTIC MAPREDUCE(EMR)and setting up environments on Amazon AWS EC2 instances.
  • Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
  • Experience designing and executing time driven and data driven Oozie workflows.
  • Experience developing algorithms for full-text search capability using Solr.
  • Experience processing Avro data files using Avro tools and MapReduce programs.
  • Experience developing programs to deal with multiple compression formats such as LZO, GZIP,Snappy and LZ4.
  • Experience loading and transforming large amounts of structured and unstructured data into HBase database and exposure handling Automatic failover in HBase.

ENVIRONMENT: Hadoop, Spark, Spark-Streaming, Spark SQL, AWS EMR, HDFS, Hive, Pig, Apache Kafka, Sqoop, Cloudera (CDH 4), HDFS, Hive, Flume, Sqoop, Pig, Java, Eclipse, Teradata, MongoDB, Ubuntu, UNIX, and Maven.

Confidential, Minneapolis, MN

Spark/Scala Developer


  • Analyze and define researcher’s strategy and determine system architecture and requirement to achieve goals.
  • Developed multiple Kafka Producers and Consumers from as per the software requirement specifications.
  • Used Kafka for log accumulation like gathering physical log documents off servers and places them in a focal spot like HDFS for handling.
  • Configured Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
  • Used various spark Transformations and Actions for cleansing the input data.
  • Developed shell scripts to generate the hive create statements from the data and load the data into the table.
  • Wrote Map Reduce jobs using Java API and Pig Latin
  • Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
  • Involved in writing custom Map-Reduce programs using java API for data processing.
  • Integrated Maven build and designed workflows to automate the build and deploy process.
  • Involved in developing a linear regression model to predict a continuous measurement for improving the observation on wind turbine data developed using spark with Scala API.
  • The hive tables are created as per requirement were Internal or External tables defined with appropriate static, dynamic partitions and bucketing, intended for efficiency.
  • Load and transform large sets of structured, semi structured data using hive.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Develop Hive queries for the analysts.
  • Cassandra implementation using Datastax Java API.
  • Very good understanding Cassandra cluster mechanism that includes replication strategies, snitch, gossip, consistent hashing and consistency levels.
  • Experienced in using the spark application master to monitor the spark jobs and capture the logs for the spark jobs.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
  • Involved in making code changes for a module in turbine simulation for processing across the cluster using spark-submit.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
  • Imported the data from different sources like AWS S3, LFS into Spark RDD.
  • Involved in performing the analytics and visualization for the data from the logs and estimate the error rate and study the probability of future errors using regressing models.
  • Used WEB HDFS REST API to make the HTTP GET, PUT, POST and DELETE requests from the webserver to perform analytics on the data lake.
  • Worked on a POC to perform sentiment analysis of twitter data using Open NLP API.
  • Worked on high performance computing (HPC) to simulate tools required for the genomics pipeline.
  • Used Kafka to patch up a customer activity taking after pipeline as a course of action of steady appropriate subscribe supports.
  • Cluster coordination services through Zookeeper

Environment: Hadoop, Hive, HDFS, HPC, WEBHDFS, WEBHCAT, Spark, Spark-SQL, KAFKA, Java, Scala, Web Server’s, Maven Build and SBT build.

Confidential, St. Louis, MO

Jr. Big Data Developer


  • Exported data to a MySQLfrom HDFS using Sqoop and NFS mount approach.
  • Moved data from HDFS to Cassandra using Map Reduce and BulkOutputFormat class.
  • Developed Map Reduce programs for applying business rules on the data.
  • Developed and executed hive queries for denormalizing the data.
  • Moving Bulk amount data into HBase using Map Reduce Integration.
  • Works with ETL workflow, analysis of big data and loaded them into Hadoop cluster.
  • Installed and configured Hadoop Cluster for development and testing environment.
  • Implemented Fair scheduler on the Job tracker to share the resources of the cluster for the map reduces jobs given by the users.
  • Strong understanding of Hadoop eco system such as HDFS, MapReduce, HBase, Zookeeper, Pig, Hadoop streaming, Sqoop, Oozie and Hive.
  • Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
  • Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
  • Automated the workflow using shell scripts.
  • Imported log files using Flumeinto HDFS and load into Hive tables to query data.
  • Performance tuning of the Hive queries, written by other developer.
  • Mastered major Hadoop distros HDP/CDH and numerous Open Source projects
  • Prototype various applications that utilize modern Big Data tools.

Environment: Linux, Java, Map Reduce, HDFS, DB2, Cassandra, Hive, HBase, Flume, Pig, Sqoop, FTP.

Confidential, St Louis, MO

Hadoop Developer


  • Worked on installing Hadoop eco system components.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Using Sqoop to import the Customer Data from MySQL database.
  • Responsible for developing MapReduce, a parallel processing programming model to pre-process the data in the HDFS.
  • Developed pig scripts for analyzing large data sets in the HDFS.
  • Implemented the workflows using Oozie for running MapReduce and Pig jobs.
  • Hands on experience working on Structured, Semi structured and unstructured data.
  • Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
  • Developed user defined functions to provide custom pig capabilities.

Environment: Apache Hadoop, Java (jdk1.6), Pig, MySQL, Sqoop and Oozie.


Java/Big Data Developer


  • Coordinated with business customers to gather business requirements.
  • Involved in Design and Development of Technical specification using Hadoop technology.
  • Involved in Implementing Data model in Cassandra database.
  • Performed different kind of Transactions on Cassandra data using Thrift API.
  • Analyzed web server log Data using Apache Flume.
  • Used Spring Framework, Spring-AOP, Spring-ORM, Spring JDBC modules.
  • Created Pig Latin scripts to sort, group, Join, Filter the enterprise wise Data.
  • Analyzed Large Data sets by running Hive queries and Pig scripts.
  • Worked on Custom Map reduce programs in order to analyze data and used Pig Latin to clean unwanted data.
  • Developed Scripts and Batch Jobs to schedule various Hadoop Programs.

Environment: CDH3, Linux, Java, Map Reduce, HDFS, DB2, Cassandra, Hive,Oozie, Flume, Pig, Sqoop, Maven, Oracle11g


Java/J2EE Developer


  • Played an active role in the team by interacting with welfare business analyst/program specialists and converted business requirements into system requirements.
  • Developed and deployed UI layer logics of sites using JSP.
  • Struts (MVC) is used for implementation of business model logic.
  • Worked with Struts MVC objects like Action Servlet, Controllers, and validators, Web Application Context, Handler Mapping, Message Resource Bundles and JNDI for look-up for J2EE components.
  • Developed dynamic JSP pages with Struts.
  • Developed the XML data object to generate the PDF documents and other reports.
  • Used Hibernate, DAO, and JDBC for data retrieval and medications from database.
  • Messaging and interaction of Web Services is done using SOAP and REST
  • Developed JUnit Test cases for Unit Test cases and as well as System and User test scenarios
  • Worked with Restful web services to enable interoperability.

Environment: core Java, J2EE, JDBC, Java 1.4, Servlets, JSP, Struts, Hibernate, Web services, RESTful services, SOAP, WSDL, Design Patterns, MVC, HTML, JavaScript 1.2, WebLogic 8.0, XML, Junit, Oracle 10g, My Eclipse.


Jr. Java Developer


  • Analyzing and preparing the requirement Analysis Document.
  • Deploying the Application to the JBOSS Application Server.
  • Implemented Web Service using SOAP protocol using Apache Axis.
  • Requirement gatherings from various parties involved in the project
  • Study OAuth/JWT/TOTP/SAML series protocol for SSO solution
  • Used to J2EE and EJB to handle the business flow and Functionality.
  • Involved in the complete SDLC of the Development with full system dependency.
  • Actively coordinated with deployment manager for application production launch.
  • Provide Support and update for the period under warranty.
  • Monitoring of test cases to verify actual results against expected results.
  • Carrying out Regression testing to track the problem tracking.

Environment: Java, J2EE, EJB, UNIX, XML, Work Flow, JMS, JIRA, Oracle, JBOSS, Soap

Hire Now