We provide IT Staff Augmentation Services!

Bigdata Engineer Resume

New York City, NY


  • 4+ years in Analysis, Design, Development, Implementation, Maintenance and Support with experience in developing strategic methods for deploying Big Data technologies to efficiently solve Big Data processing requirement.
  • Experience in developing data pipeline using Sqoop, and Flume to extract the data from weblogs and store in HDFS.
  • Experience in managing and reviewing Hadoop Log files using FLUME and Kafka and also developed the Pig UDF's and Hive UDF's to pre - process the data for analysis. Worked on Impala for Massive parallel processing of Hive queries.
  • Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
  • Efficient in working with Hive data warehouse tool creating tables, data distributing by implementing Partitioning and Bucketing strategy, writing and optimizing the HiveQL queries.
  • Experience in ingestion, storage, querying, processing and analysis of Big Data with hands on experience in Big Data including Apache Spark, Spark SQL and Spark Streaming.
  • Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • Worked with Spark engine to process large scale data and experience to create Spark RDD and developing Spark Streaming jobs by using RDDs and leverage Spark-Shell.
  • Having experience on RDD architecture and implementing Spark operations on RDD and also optimizing transformations and actions in Spark.
  • Hands on experience in Apache Spark jobs using Scala in test environment for faster data processing and used SparkSQL for querying.
  • I have been experienced with SPARK SREAMING API to ingest data into SPARK ENGINE from KAFKA.
  • Worked on real time data integration using Kafka - Storm data pipeline, Spark streaming and HBase.
  • Experienced in implementing unified data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies.
  • Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark.
  • Good experience in using Data Modelling techniques to find the results based on SQL and PL/SQL queries.
  • Good working knowledge on Spring Framework.
  • Strong Experience in writing SQL queries.
  • Experience in Object Oriented Analysis Design (OOAD) and development of software using UML Methodology, good knowledge of J2EE design patterns and Core Java design patterns.
  • Expertise in design and development of Web Applications involving J2EE technologies with Java, Spring, EJB, AJAX, Servlets, JSP, Struts, Web Services, XML, JMS, JSP, UNIX shell scripts, SERVLETS, MS SQL SERVER, SOAP and RESTful web services.
  • Extensively development experience in different IDE's like Eclipse, NetBeans.
  • Experience in core Java, JDBC and proficient in using Java API's for application development.
  • Experience in Deploying web application using application servers WebLogic, Apache Tomcat, WebSphere and JBOSS.


Big Data Technologies: Hadoop/Big Data Spark, Flume, Kafka, Hive, HBase, Pig, HDFS, YARN, Scala, Hortonworks, Cloudera, Mapreduce, Python, Sqoop, Zookeeper, Oozie, Storm, Tez, Impala, Ambari

AWS Components: EC2, EMR, S3, RDS, CloudWatch

Languages/Technologies: Core Java, Scala, JDBC, Junit, C, C++, XML, SQL, HTML, HQL, Shell Script

Operating Systems: Linux, Windows, Centos, Ubuntu, RHEL

Databases: MySQL, Oracle 11g/10g/9i, MS-SQL Server, HBase, Cassandra, Mongodb

Tools: Winscp, Wireshark, JIRA, IBM Tivoli, MS Office Suite, DB Visualizer, Putty, VM Player, VMware, Eclipse, Net beans.


Confidential, New York City, NY

Bigdata Engineer


  • Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
  • Developed shell scripts to perform Data Quality validations like Record count, File name consistency, Duplicate File and for creating Tables and views.
  • Creating the views by masking PHI Columns for the table, so that data in the view for the PHI columns cannot be seen by unauthorized teams.
  • Worked on Parquet File format to get a better storage and performance for publish tables.
  • Worked with NoSQL databases like HBase in creating HBase tables to store the audit data of the RAWZ and APPZ tables.
  • Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
  • Developed shell scripts for performing transformation logic and loading the data from raw zone to app zone.
  • Responsible for developing Python wrapper scripts to perform the transformations on the data.
  • Responsible for creation of mapping document from source fields to destination fields mapping.
  • Created Different data Pipelines using Stream sets to land the data from source to Raw zone.
  • Worked on different files like csv, txt, fixed width to load the data from source to rawz tables.
  • Experienced in using Kafka as a data pipeline for the Json data between source and destination
  • Responsible for triggering the Jobs using CNTRL M.
  • Worked in Agile Scrum model and involved in sprint activities.
  • Worked with Bitbucket, Jira, for the deployed the projects into production environments

Environment: Apache Hive, Hbase, Pyspark, python, Agile, Stream sets, Bitbucket, Cloudera, Shell Scripting.

Confidential, Johnston, RI

Bigdata Engineer


  • Responsible for loading Data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Developed Kafka producer and consumers, Cassandra clients and Spark along with components on HDFS, Hive.
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Used Kafka to ingest data into Spark engine.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Managing and scheduling Spark Jobs on a Hadoop Cluster using Oozie.
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Having experience on RDD architecture and implementing Spark operations on RDD and optimizing transformations and actions in Spark.
  • Performed job functions using Spark API's in Scala for real time analysis and for fast querying purposes.
  • Experienced in implementing Spark RDD transformations, actions to implement business analysis and Worked with Spark accumulators and broadcast variables.
  • Involved in working with Impala for data retrieval process.
  • Exported data from Impala to Tableau reporting tool, created dashboards on live connection.
  • Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's.
  • Loaded data from Linux file system to HDFS and vice-versa
  • Developed UDF's using both DataFrames/SQL and RDD in Spark for data Aggregation queries and reverting back into OLTP through Sqoop.
  • Developed Spark code using Scala and Spark-SQL for faster processing and testing.
  • Implemented Spark sample programs in python using pyspark.
  • Analyzed the SQL scripts and designed the solution to implement using pyspark.
  • Developed pyspark code to mimic the transformations performed in the on premise environment.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Developed analytical component using Scala, Spark and Spark Stream.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Assisted in performing unit testing of Map Reduce jobs using MRUnit.
  • Assisted in exporting data into Cassandra and writing column families to provide fast listing outputs.
  • Used Oozie Schedulersystems to automate the pipeline workflow and orchestrate the map reduce jobs that extract
  • Used Zookeeper for providing coordinating services to the cluster.
  • Worked with Hue GUI in scheduling jobs with ease and File browsing, Job browsing, Metastore management.

Environment: Apache Hadoop, HDFS, Hive, Java, Sqoop, Spark, Cloudera CDH4, Oracle, MySQL, Tableau, Talend, Elastic search, Kibana, SFTP.

Confidential, Waukegan, IL

Sr. Data Engineer


  • Developed real-time dashboards that showed live data of users logged in to the website, their behavior flow based on clicks, other traffic details using Spark Streaming tools like Apache Flume, Kafka.
  • Analyzed compression/fine tuning methods in Vertica and migrated client databases from SQL Server to Vertica.
  • Optimize and write several complex queries and stored procedures in SQL Server to generate metrics for custom reports.
  • Implemented ETL in Apache Spark by converting the old SQL logic present in SQL Server into transformations and actions.
  • Conceptualize, design and develop innovative solutions using Hive, Polybase, Sqoop and SQL Server to improve efficiency of E2E data warehousing process.
  • Developed Scala scripts, UDFFs using bothData frames/SQL/Data sets and RDD in Spark 1.6 for Data Aggregation, queries and writing data back into files saving in to the Hadoop.
  • Used Scala collection framework to store and developed Enrichments to process the complex consumer information.
  • Used Scala functional programming concepts to develop business logic.
  • Developed Spark code to using Scala and Spark-SQL for faster processing and testing.
  • Used Scala to convert Hive/SQL queries into RDD transformations in Apache Spark.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API.
  • Structured data was ingested onto the data lake using Sqoop jobs and scheduled using Oozie workflow from the IBM DB2 data sources for the incremental data.
  • Involved in converting business transformations into Spark Data Frames and RDDs using Scala.
  • Involved in integrating hive queries into spark environment using SparkSQL and Spark Scala.
  • Data was processed using spark such as aggregating, calculating the statistical values by using different transformations and actions.
  • Worked with Azure Data Factory (ADF) since it’s a great SaaS solution to compose and orchestrate Azure data services.
  • Ingested structured data from TIBCO Composite Data Virtualization tool into ADLS using Sqoop.
  • Leverage ADF to do bulk copies and incremental loads.
  • Computing the complex logics and controlling the Data flow through In-memory process tool Apache Spark.
  • Developed Spark code using Scala and Spark-SQL for faster testing and processing of data.
  • Developed and designed automate process using shell scripting for data movement and purging.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.

Environment: Hadoop, MapReduce, HDFS, Scala, SparkCloudera Manager, Pig, Sqoop, Zookeeper, Teradata, PL/SQL, MySQL, Windows, HBase.

Confidential, Hartford, CT

Bigdata Developer


  • Written Map-Reduce code to process all the log files with rules defined in HDFS (as log files generated by different devices have different xml rules).
  • Developed and designed application to process data using Spark.
  • Developed Map Reduce jobs, Hive & PIG scripts for Data warehouse migration project.
  • Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
  • Developing Map Reduce jobs, Hive & PIG scripts for Risk & Fraud Analytics platform.
  • Developed Data ingestion platform using Sqoop and Flume to ingest Twitter and Facebook data for Marketing & Offers platform.
  • Developed and designed automate process using shell scripting for data movement and purging.
  • Installation & Configuration Management of a small multi node Hadoop cluster.
  • Installation and configuration of other open source software like Pig, Hive, Flume, and Sqoop.
  • Developed programs in JAVA, Scala-Spark for data reformation after extraction from HDFS for analysis.
  • Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Importing and exporting data into Impala, HDFS and Hive using Sqoop.
  • Responsible to manage data coming from different sources.
  • Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data access.
  • Developed Hive tables to transform, analyze the data in HDFS.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map way.
  • Used various spark Transformations and Actions for cleansing the input data.
  • Developed and designed system to collect data from multiple portal using Kafkaand then process it using spark.
  • Developed Simple to Complex Map Reduce Jobs using Hive and Pig.
  • Involved in running Hadoop Jobs for processing millions of records of text data.
  • Developed the application by using the Struts framework.
  • Created connection through JDBC and used JDBC statements to call stored procedures.
  • Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
  • Developed the Pig UDF’S to pre-process the data for analysis.
  • Implemented multiple Map Reduce Jobs in java for data cleansing and pre-processing.
  • Moved all RDBMS data into flat files generated from various channels to HDFS for further processing.
  • Developed job workflows in Oozie to automate the tasks of loading the data into HDFS.
  • Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, loaded data into HDFS and extracted data from Teradata into HDFS using Sqoop.
  • Writing the script files for processing data and loading to HDFS.

Environment: Hadoop, Map Reduce, HDFS, Pig, Hive, Java (jdk1.7), Flat files, Oracle 11g/10g, PL/SQL, SQL*PLUS, Windows NT, Sqoop.


Hadoop Developer


  • Develop the Sqoop scripts to automate data load between Oracle/MySQL and Hadoop
  • Develop Apache spark based programs to implement complex business transformations
  • Develop Java custom record reader, partitioner and serialization techniques.
  • Use different data formats (Text, Avro, Parquet, JSON, ORC) while loading the data into HDFS.
  • Create Managed tables and External tables in Hive and loaded data from HDFS
  • Perform complex HiveQL queries on Hive tables for data profiling and reporting
  • Optimize the Hive tables using optimization techniques such as partitions and bucketing to provide better performance with HiveQL queries.
  • Use Hive to analyze partitioned and bucketed data and compute various metrics for reporting.
  • Create partitioned tables and loaded data using both static partition and dynamic partition method.
  • Create custom user defined functions in Hive to implement special date functions
  • Perform SQOOP import from Oracle to load the data in HDFS and directly into Hive tables.
  • Created and scheduled SQOOP Jobs for automated batch data load
  • Imported data from Amazon S3, Social Media (Twitter) into Spark RDD's and performed transformations and actions on RDD's in order to provide different teams the right data for business analytics
  • Use JSON and XML SerDe Properties to load JSON and XML data into Hive tables.
  • Used SparkSQL and Spark Data frame extensively to cleanse and integrate imported data into more meaningful insights
  • Dealt with several source systems( RDBMS/ HDFS/S3) and file formats(JSON/ORC and Parquet) to ingest, transform and persist data in hive for further downstream consumption
  • Built Spark Applications using IntelliJ and Maven
  • Extensively worked on Scala programing language for Data Engineering using Spark
  • Scheduled spark jobs in production environment using Oozie scheduler.
  • Maintained Hadoop jobs (Sqoop/Hive and Spark) in production environment.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, HBase, Shell Scripting, Oozie, Oracle 11g.


Java Developer


  • Involved in reading & generating pdf documents using ITEXT. And also merge the pdfs dynamically.
  • Involved in the software development life cycle coding, testing, and implementation.
  • Worked in the health-care domain.
  • Involved in Using Java Message Service (JMS) for loosely coupled, reliable and asynchronous exchange of patient treatment information among J2EE components and legacy system
  • Developed MDBs using JMS to exchange messages between different applications using MQ Series.
  • Involved in working with J2EE Design patterns (Singleton, Factory, DAO, and Business Delegate) and Model View Controller Architecture with JSF and Spring DI.
  • Involved in Content Management using XML.
  • Developed a standalone module transforming XML 837 module to database using SAX parser.
  • Installed, Configured and administered WebSphere ESB v6.x
  • Worked on Performance tuning of WebSphere ESB in different environments on different platforms.
  • Configured and Implemented web services specifications in collaboration with offshore team.
  • Involved in Creating dash board charts (business charts) using fusion charts.
  • Involved in creating reports for the most of the business criteria.
  • Involved in the configurations set for Web logic servers, DSs, JMS queues and the deployment.
  • Involved in creating queues, MDB, Worker to accommodate the messaging to track the workflows
  • Created Hibernate mapping files, sessions, transactions, Query and Criteria’s to fetch the data from DB.
  • Enhanced the design of an application by utilizing SOA.
  • Generating Unit Test cases with the help of internal tools.
  • Used JNDI for connection pooling.
  • Developed ANT scripts to build and deploy projects onto the application server.

Environment: JAVA/J2EE, HTML, JS, AJAX, Servlets, JSP, XML, XSLT, XPATH, XQuery, WSDL, SOAP, REST, JAX-RS, JERSEY, JAX-WS, Web Logic server 10.3.3, JMS, ITEXT, Eclipse, JUNIT, Star Team, JNDI, Spring framework - DI, AOP, Batch, Hibernate.

Hire Now