We provide IT Staff Augmentation Services!

Data Engineer Resume

Bloomfield, CT


  • Over 7+years of IT experience involving project development, implementation, deployment and maintenance using Hadoop ecosystem related technologies
  • 5+ years of experience in using Hadoop and its ecosystem components like HDFS, MapReduce, Yarn, Spark, Hive, Pig, HBase, Zoo Keeper, Oozie, Flume, Storm and Sqoop.
  • In depth understanding of Hadoop Architecture and its various components such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager and MapReduce concepts.
  • Strong experience creating real time data streaming solutions using Apache Spark Core, Spark SQL and Data Frames.
  • Using Informatica PowerCenter Designer analyzed teh source data to Extract & Transform from various source systems(oracle 10g,DB2,SQL server and flat files) by incorporating business rules using different objects and functions that teh tool supports.
  • Created Good experience in designing, Implementation of Data warehousing and Business Intelligence solutions using ETL tools like Informatica Power Center, Informatica Developer (IDQ), Informatica Power Exchange and Informatica Intelligent Cloud Services (IICS)
  • Hands on experience wif Spark streaming to receive real time data using Kafka.
  • Developed Simple to complex Map/reduce streaming jobs using Java language.
  • Worked extensively wif Hive DDLs and HiveQLs.
  • Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion .
  • Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
  • Experience in analyzing data using Python, R, SQL, Microsoft Excel,Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning.
  • Involved in building teh ETL architecture and Source to Target mapping to load data into Data warehouse.
  • Experience wif Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python - Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
  • Hands-on experience in handling database issues and connections wif SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in python.
  • Experience working wif IICS concepts relating to data integration, Monitor, Administrator, deployments, permissions, schedules.
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Extensive experience wif ETL and Query big data tools like Pig Latin and Hive QL.
  • Used Pig as ETL tool to do transformations, event joins, filter and some pre-aggregations.
  • Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java.
  • Developed Hive and Pig scripts for handling business transformations and analyzing data.
  • Developed Sqoop scripts for large dataset transfer between Hadoop and RDBMs.
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Worked on NoSQL databases including HBase, Cassandra and Mongo DB.
  • Comprehensive experience in building Web-based applications using J2EE Frame works like Spring, Hibernate, Struts and JMS.
  • Experience in HBase Cluster Setup and Implementation.
  • Good working experience in design and application development using IDE's like Eclipse, Net Beans, IntelliJ.
  • Experience in setting up automated monitoring and escalation infrastructure for Hadoop Cluster using Ganglia and Nagios.
  • Experience wif Big Data ML toolkits, such as Mahout and Spark ML.
  • Expert database engineer, NoSQL and relational data modeling.
  • Assisted in Cluster maintenance, Cluster Monitoring, Managing and Reviewing data backups and log files.
  • Analyzed data wif Hue, using Apache Hive via Hue’s Beeswax and Catalog applications.
  • Expertise in optimizing traffic across network using Combiners, joining multiple schema datasets using Joins and organizing data using Partitions and Buckets.
  • Experience in installation, configuration, support and monitoring of Hadoop clusters using Apache, Cloudera distributions and AWS.
  • Experience in working wif various Cloudera distributions (CDH4/CDH5), Hortonworks and Amazon EMR Hadoop Distributions.
  • Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, and Amazon Elastic Compute Cloud (Amazon EC2).
  • Extensive knowledge of utilizing cloud-based technologies using Amazon Web Services (AWS), VPC, EC2, Route S3, Dynamo DB, Elastic Cache Glacier, RRS, Cloud Watch, Cloud Front, Kinesis, Redshift, SQS, SNS, RDS.
  • Set up standards and processes for Hadoop based application design and implementation.
  • Experience in integration of various data sources like Java, RDBMS, Shell Scripting, Spreadsheets, Text files, XML and Avro.
  • Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
  • Generated ETL reports using Tableau and created statistics dashboards for Analytics.
  • Extensive experience in ETL tools like Teradata Utilities, Informatica, Oracle.
  • Proficient in using data visualization tools like Tableau and MS Excel.
  • Experience in component design using UML Design, Use case, Class, Sequence, Deployment and Component diagrams for teh requirements
  • Familiar wif Java virtual machine (JVM) and multi-threaded processing.
  • Familiarity wif common computing environment (e.g. Linux, Shell Scripting).
  • Detailed understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Waterfall and Agile.
  • Will follow solution-oriented approaches to deliver right solutions at right time.
  • Good team player wif ability to solve problems, organize and prioritize multiple tasks.
  • Excellent communication and inter-personal skills wif technical competency and ability to quickly learn new technologies as required.
  • Ability to blend technical expertise wif strong Conceptual, Business and Analytical skills to provide quality solutions and result-oriented problem solving technique and leadership skills.


Hadoop/Big Data: Hadoop (Yarn), HDFS, MapReduce, Spark, Hive, Pig, Sqoop, Flume, Kafka, Storm, Zookeeper, Oozie, Tez, Impala, Mahout, Ganglia, Nagios, Airflow.

Java/J2EE Technologies: Java Beans, JDBC, Servlets, RMI & Web services

Cloud: AWS, Cloudera, Azure

Development Tools: Eclipse, IBM DB2 Command Editor, QTOAD, SQL Developer, Microsoft Suite (Word, Excel, PowerPoint, Access), VM Ware

Web/ApplicationServers: Apache Tomcat, WebLogic, WebSphere Application Server, Websphere.

Frameworks: Hibernate, EJB, Struts, Spring

Programming/ScriptingLanguages: Java, SQL, Unix Shell Scripting, Python, AngularJS.

Databases: Oracle 11g/10g/9i, MySQL, SQL Server2005,2008, Teradata 14/12

NoSQL Databases: HBase, Cassandra, MongoDB

ETL Tools: Informatica, IICS

Visualization: Tableau and MS Excel.

Modeling languages: UML Design, Use case, Class, Sequence, Deployment and Component diagrams.

Version Control Tools: Sub Version (SVN), Concurrent Versions System (CVS) and IBM Rational ClearCase.

Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall.

Operating Systems: Windows 98/2000/XP/Vista/7/8,10, Unix, Linux and Solaris.


Confidential - Bloomfield, CT

Data Engineer


  • Worked in AWS environment for development and deployment of Custom Hadoop Applications.
  • Developed data pipeline using Map Reduce, Flume, Sqoop and Pig to ingest customer behavioral data into HDFS for analysis.
  • Manage and support of enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera &Hortonworks HDP.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
  • Worked wif teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch.
  • Implemented algorithms for real time analysis in Spark.
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, Spark and Hive.
  • Implemented an AWS-based S3 data lake using a metadata driven data pipeline leveraging AWS Lambda, AWS Data Pipeline, EC2, Snowflake in teh Python language and served to business partners through Tableu.
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
  • Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark and Shell scripts (for scheduling of few jobs) extracted and loaded data into Data Lake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
  • Create several types of data visualizations using Python and Tableau.
  • Use Relational Modeling and Dimensional Data Modeling using Star & Snowflake schema, De normalization, Normalization, and Aggregations.
  • Wrote Data Pipeline that fetches Adobe Omniture data which is routed to S3 using SQS every hour
  • Conducted statistical analysis on Healthcare data using python and various tools.
  • Built trust in manufacturing reports by solving critical bugs reported by teh business.
  • Confer wif clients regarding teh nature of teh Data/information processing or Reporting/Analytical needs.
  • Strong experience in working wifELASTIC MAPREDUCE(EMR) and setting up environments on AmazonAWSEC2 instances
  • Experience working wif Key Range Partitioning in IICS, handling File loads wif concept of File list option, creating fixed wif file format and more, file listener and more.
  • Experience integrating data using IICS for reporting needs.
  • Handled real-time data using Kafka.
  • Used Agile Scrum methodology and Scrum Alliance for development.

Environment: Hadoop, Java, MapReduce, AWS, HDFS, Redshift, Scala, Python, DynamoDB, Spark, Hive, Pig, Linux, XML, Eclipse, Cloudera, CDH4/5 Distribution, Teradata, EC2, Flume, Zookeeper, Cassandra, SparkMLLib, Informatica, Teradata, Hortonworks, Elasticsearch, DB2, YARN, SQL Server, Informatica, Oracle 12c, SQL, Scala, MySQL, R.

Confidential - McLean, VA

Data Engineer


  • Developed simple to complex Map Reduce streaming jobs using Java language for processing and validating teh data.
  • Developed data pipeline using Map Reduce, Flume, Sqoop and Pig to ingest customer behavioral data into HDFS for analysis.
  • Worked wif Spark for improving performance and optimization of teh existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Developed MapReduce and Spark jobs to discover trends in data usage by users.
  • Using Informatica PowerCenter created mappings and mapplets to transform teh data according to teh business rules
  • TEMPEffectively using IICS Data integration console to create mapping templates to bring data into staging layer from different source systems like Sql Server, Oracle, Teradata, Salesforce, Flat Files, Excel Files, PWX.
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
  • Experience working wif IICS transformations like Expression, joiner, union, lookup, sorter, filter, normalizer and various concepts like macro fields to templatize column logic, smart match fields, renaming bulk fields and more.
  • Implemented slowly changing dimensions (SCD) for some of teh Tables as per user requirement.
  • Implemented Spark using Python and Spark SQL for faster processing of data.
  • Implemented algorithms for real time analysis in Spark.
  • Responsible for designing and implementing teh data pipeline using Big Data tools including Hive, Oozie, Airflow, Spark, Drill, Kylin, Sqoop, Kylo, Nifi, EC2, ELB, S3 and EMR.
  • Used Spark for interactive queries, processing of streaming data and integration wif popular NoSQL database for huge volume of data.
  • Created and Configured Workflows and Sessions to transport teh data to target warehouse Oracle tables using Informatica Workflow Manager.
  • Experience in building Real-time Data Pipelines wif Kafka Connect and Spark Streaming.
  • Used teh Spark -CassandraConnector to load data to and fromCassandra.
  • Experienced wifTeradata utilities Fast Load, Multi Load, BTEQ scripting, Fast Export, SQL Assistant.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and tan loading data into HDFS.
  • Exported teh analyzed data to teh relational databases using Sqoop, to further visualize and generate reports for teh BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • UsedInformatica as ETL tool to pull data from source systems/ files, cleanse, transform and load data into teh Teradatausing Teradata Utilities.
  • Involved in understanding requirements and in modeling activities of teh attributes identified from different source systems which are in Oracle, Teradata, CSV FILES. Data is Staged, integrated, Validated and finally loaded teh data into Teradata Warehouse using Informatica and Teradata Utilities.
  • Analyzed teh data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • Developed Pig Latin scripts to perform Map Reduce jobs.
  • Wrote Data Pipeline that fetches Adobe Omniture data which is routed to S3 using SQS every hour.
  • Developed product profiles using Pig and commodity UDFs.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate teh data.
  • Created HBase tables and column families to store teh user event data.
  • Written automated HBase test cases for data quality checks using HBase command line tools.
  • Created UDF’s to store specialized data structures inHBase and Cassandra.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs.
  • Used Impala to read, write and query teh Hadoop data in HDFS from HBase or Cassandra.
  • Used Tez framework for building high performance jobs in Pig and Hive.
  • Configured Kafka to read and write messages from external programs.
  • Configured Kafka to handle real time data.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • UsedSqoopto import teh data on toCassandratables from different relational databases like Oracle, MySQL and Designed Column families in Cassandra performed data transformations, and tan export teh transformed data to Cassandra as per teh business requirement.
  • Developed end to end data processing pipelines that begin wif receiving data using distributed messaging systems Kafka through persistence of data into HBase.
  • Written Storm topology to emit data into Cassandra DB.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data in In Azure Databricks.
  • Written Storm topology to accept data from Kafka Producer and process teh data.
  • Installed Solr on web servers to index teh search data and performed real time updates.
  • Developed core search component using Solr.
  • Continuous monitoring and managing teh Hadoop cluster using Cloudera Manager.
  • Used JUnit framework to perform Unit testing of teh application
  • Developed interactive shell scripts for scheduling various data cleansing and data loading process.
  • Performed data validation on teh data ingested using MapReduce by building a custom model to filter all teh invalid data and cleanse teh data.
  • Experience wif data wrangling and creating workable datasets.

Environment: Hadoop, Map Reduce, Spark, AWS, Azure, Azure Data Lake, Pig, Hive, Sqoop, Oozie, HBase, Zoo keeper, Kafka,Flume, Solr, StormTez, Impala, Mahout, Cassandra, Teradata, Informatica, Cloudera manager, MySQL, Jaspersoft, Multi-node cluster wif Linux-Ubuntu, Windows, Unix.

Confidential - Brooklyn, NY

Hadoop Developer


  • Worked in teh BI team for Hadoop clusters implementation and data integration in developing large-scale system software.
  • Involved in requirement gathering for teh project.
  • Developed simple to complex Map Reduce jobs in Java to perform data extraction, aggregation, transformation and rule checks on multiple file formats like XML, Json, Csv and compressed file formats.
  • Optimized Map Reduce jobs to use HDFS efficiently by using various compression mechanisms.
  • Involved in running Hadoop streaming jobs to process terabytes of text data.
  • Importing/exporting teh data between RDBMS and HDFS using Sqoop.
  • Developed Hivescripts to create data meta stores and tables/partitions and load data into teh tables/partitions.
  • Extensively used Pig for data cleansing and processing and performing transformations.
  • Developed Pig Latin scripts to extract teh data from teh web server output files to transform and load into HDFS.
  • Implemented business logic by writing UDF’s in Hive and Pig and Hive QL to process teh data for analysis.
  • Exported teh result sets from Hive to RDBMS using Shell scripts.
  • Involved in tuning Hive and Pig scripts to improve performance.
  • Involved in teh database migrations/transfer of data from various database to HDFSand storing in different file formats (Text, Avro) and virtualization for applications.
  • Used Mahout to understand teh machine learning algorithms for efficient data processing.
  • Developed and configured Oozie workflow engine for scheduling and managing teh Pig, Hive and Sqoop jobs.
  • Used Zookeeper for various types of centralized configurations.
  • Used ApacheTezfor performing batch and interactive data processing applications on Pig and Hive jobs.
  • Written Storm Bolt to emit data into Hbase, HDFS, Rabbit-MQ Web Stomp.
  • Written Junit test cases for Storm Topology.
  • Deployed an Apache Solr search engine server to speed up teh search process.
  • Created a wrapper library to help teh rest of teh team use teh Solr database.
  • Customized Apache Solr to handle fallback searching and provide custom functions.
  • Involved in managing and reviewing Hadoop log files for any warnings or failures.
  • Supported in Production rollout which includes monitoring teh solution post go-live and resolving any issues that are discovered by teh client and client services teams.
  • Designed, documented operational problems by following standards and procedures using JIRA.

Environment: Apache Hadoop, Map Reduce, HDFS, Pig, Hive, HBase, Sqoop, Oozie, Solr, Mahout, Impala, Tez, Kafka, Storm, Zookeeper, IDE, Java, DataStax, Flat files,JIRA, Oracle 11g/10g, MySQL, Toad, SVN, Windows, UNIX.


Java Developer


  • Involved in gathering requirements and analysis through interaction wif teh end users.
  • Worked directly wif clients in automating release management tasks, reducing defect counts in teh testing phases to ensure smooth implementation of projects.
  • Involved in teh design and creationofClass diagrams, Sequence diagrams and Activity Diagrams using UML models.
  • Designed and developed teh application using various Design Patterns such as Front controller, Session Facade and Service Locator.
  • Developed teh Search Widget using JSP, Struts, Tiles, JavaScript and AJAX.
  • Created teh scripting code to validate teh data.
  • Involved in developing JSP pages using Struts custom tags, jQuery and Tiles Framework.
  • Used JavaScript to perform client side validations and Struts-Validator Framework for server-side validation
  • Implemented Singletonclasses for property loading and static data from DB.
  • Debugged and developed applications using Rational Application Developer (RAD).
  • Developed a Web service to communicate wif teh database using SOAP.
  • Developed DAO (data access objects) using Spring Framework 3.
  • Deployed teh components in to WebSphere Application server 7.
  • Generated build files using Maven tool.
  • Implemented Hibernate in teh data access object layer to access and update information in teh Oracle Database.
  • Developed test environment for testing all teh Web Service exposed as part of teh core module and their integration wif partner services in Integration test.
  • Involved in writing queries, stored procedures and functions using SQL, PL/SQLand in backend tuning SQL queries/DB script.
  • Responsible for performing end-to-end system testing of application writing JUnit test cases
  • As part of teh development team Contributed for Application Support in Soft launch and UAT phase and in Production support using IBM clear quest for fixing bugs.

Environment: Java EE, IBM WebSphere Application Server, Apache-Struts, EJB, Spring, JSP, Web Services, JQuery, Servlet, Struts-Validator, Struts-Tiles, Tag Libraries, Maven, JDBC, Oracle 10g/SQL, JUNIT, CVS, AJAX, Rational clear case, Eclipse, JSTL, DHTML, Windows, UNIX.

Hire Now