We provide IT Staff Augmentation Services!

Data Engineer Resume

Bloomfield, CT


  • Over 7+years of IT experience involving project development, implementation, deployment and maintenance using Hadoop ecosystem related technologies
  • 5+ years of experience in using Hadoop and its ecosystem components like HDFS, MapReduce, Yarn, Spark, Hive, Pig, HBase, Zoo Keeper, Oozie, Flume, Storm and Sqoop.
  • In depth understanding of Hadoop Architecture and its various components such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager and MapReduce concepts.
  • Strong experience creating real time data streaming solutions using Apache Spark Core, Spark SQL and Data Frames.
  • Using Informatica PowerCenter Designer analyzed the source data to Extract & Transform from various source systems(oracle 10g,DB2,SQL server and flat files) by incorporating business rules using different objects and functions that the tool supports.
  • Created Good experience in designing, Implementation of Data warehousing and Business Intelligence solutions using ETL tools like Informatica Power Center, Informatica Developer (IDQ), Informatica Power Exchange and Informatica Intelligent Cloud Services (IICS)
  • Hands on experience with Spark streaming to receive real time data using Kafka.
  • Developed Simple to complex Map/reduce streaming jobs using Java language.
  • Worked extensively with Hive DDLs and HiveQLs.
  • Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion .
  • Created Azure SQL database, performed monitoring and restoring of Azure SQL database. Performed migration of Microsoft SQL server to Azure SQL database.
  • Experience in analyzing data using Python, R, SQL, Microsoft Excel,Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning.
  • Involved in building the ETL architecture and Source to Target mapping to load data into Data warehouse.
  • Experience with Requests, Report Lab, NumPy, SciPy, Pytables, cv2, imageio, Python - Twitter, Matplotlib, HTTPLib2, Urllib2, Beautiful Soup, Data Frame and Pandas python libraries during development lifecycle.
  • Hands-on experience in handling database issues and connections with SQL and NoSQL databases like MongoDB, Cassandra, Redis, CouchDB, DynamoDB by installing and configuring various packages in python.
  • Experience working with IICS concepts relating to data integration, Monitor, Administrator, deployments, permissions, schedules.
  • Developed UDF, UDAF, UDTF functions and implemented it in HIVE Queries.
  • Extensive experience with ETL and Query big data tools like Pig Latin and Hive QL.
  • Used Pig as ETL tool to do transformations, event joins, filter and some pre-aggregations.
  • Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java.
  • Developed Hive and Pig scripts for handling business transformations and analyzing data.
  • Developed Sqoop scripts for large dataset transfer between Hadoop and RDBMs.
  • Experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper.
  • Worked on NoSQL databases including HBase, Cassandra and Mongo DB.
  • Comprehensive experience in building Web-based applications using J2EE Frame works like Spring, Hibernate, Struts and JMS.
  • Experience in HBase Cluster Setup and Implementation.
  • Good working experience in design and application development using IDE's like Eclipse, Net Beans, IntelliJ.
  • Experience in setting up automated monitoring and escalation infrastructure for Hadoop Cluster using Ganglia and Nagios.
  • Experience with Big Data ML toolkits, such as Mahout and Spark ML.
  • Expert database engineer, NoSQL and relational data modeling.
  • Assisted in Cluster maintenance, Cluster Monitoring, Managing and Reviewing data backups and log files.
  • Analyzed data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications.
  • Expertise in optimizing traffic across network using Combiners, joining multiple schema datasets using Joins and organizing data using Partitions and Buckets.
  • Experience in installation, configuration, support and monitoring of Hadoop clusters using Apache, Cloudera distributions and AWS.
  • Experience in working with various Cloudera distributions (CDH4/CDH5), Hortonworks and Amazon EMR Hadoop Distributions.
  • Good knowledge on AWS infrastructure services Amazon Simple Storage Service (Amazon S3), EMR, and Amazon Elastic Compute Cloud (Amazon EC2).
  • Extensive knowledge of utilizing cloud-based technologies using Amazon Web Services (AWS), VPC, EC2, Route S3, Dynamo DB, Elastic Cache Glacier, RRS, Cloud Watch, Cloud Front, Kinesis, Redshift, SQS, SNS, RDS.
  • Set up standards and processes for Hadoop based application design and implementation.
  • Experience in integration of various data sources like Java, RDBMS, Shell Scripting, Spreadsheets, Text files, XML and Avro.
  • Hands on experience in Sequence files, RC files, Combiners, Counters, Dynamic Partitions, Bucketing for best practice and performance improvement.
  • Generated ETL reports using Tableau and created statistics dashboards for Analytics.
  • Extensive experience in ETL tools like Teradata Utilities, Informatica, Oracle.
  • Proficient in using data visualization tools like Tableau and MS Excel.
  • Experience in component design using UML Design, Use case, Class, Sequence, Deployment and Component diagrams for the requirements
  • Familiar with Java virtual machine (JVM) and multi-threaded processing.
  • Familiarity with common computing environment (e.g. Linux, Shell Scripting).
  • Detailed understanding of Software Development Life Cycle (SDLC) and sound knowledge of project implementation methodologies including Waterfall and Agile.
  • Will follow solution-oriented approaches to deliver right solutions at right time.
  • Good team player with ability to solve problems, organize and prioritize multiple tasks.
  • Excellent communication and inter-personal skills with technical competency and ability to quickly learn new technologies as required.
  • Ability to blend technical expertise with strong Conceptual, Business and Analytical skills to provide quality solutions and result-oriented problem solving technique and leadership skills.


Hadoop/Big Data: Hadoop (Yarn), HDFS, MapReduce, Spark, Hive, Pig, Sqoop, Flume, Kafka, Storm, Zookeeper, Oozie, Tez, Impala, Mahout, Ganglia, Nagios, Airflow.

Java/J2EE Technologies: Java Beans, JDBC, Servlets, RMI & Web services

Cloud: AWS, Cloudera, Azure

Development Tools: Eclipse, IBM DB2 Command Editor, QTOAD, SQL Developer, Microsoft Suite (Word, Excel, PowerPoint, Access), VM Ware

Web/ApplicationServers: Apache Tomcat, WebLogic, WebSphere Application Server, Websphere.

Frameworks: Hibernate, EJB, Struts, Spring

Programming/ScriptingLanguages: Java, SQL, Unix Shell Scripting, Python, AngularJS.

Databases: Oracle 11g/10g/9i, MySQL, SQL Server2005,2008, Teradata 14/12

NoSQL Databases: HBase, Cassandra, MongoDB

ETL Tools: Informatica, IICS

Visualization: Tableau and MS Excel.

Modeling languages: UML Design, Use case, Class, Sequence, Deployment and Component diagrams.

Version Control Tools: Sub Version (SVN), Concurrent Versions System (CVS) and IBM Rational ClearCase.

Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall.

Operating Systems: Windows 98/2000/XP/Vista/7/8,10, Unix, Linux and Solaris.


Confidential - Bloomfield, CT

Data Engineer


  • Worked in AWS environment for development and deployment of Custom Hadoop Applications.
  • Developed data pipeline using Map Reduce, Flume, Sqoop and Pig to ingest customer behavioral data into HDFS for analysis.
  • Manage and support of enterprise Data Warehouse operation, big data advanced predictive application development using Cloudera &Hortonworks HDP.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
  • Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch.
  • Implemented algorithms for real time analysis in Spark.
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Ambari, Spark and Hive.
  • Implemented an AWS-based S3 data lake using a metadata driven data pipeline leveraging AWS Lambda, AWS Data Pipeline, EC2, Snowflake in the Python language and served to business partners through Tableu.
  • Worked on various data formats like AVRO, Sequence File, JSON, Map File, Parquet and ORC.
  • Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark and Shell scripts (for scheduling of few jobs) extracted and loaded data into Data Lake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
  • Create several types of data visualizations using Python and Tableau.
  • Use Relational Modeling and Dimensional Data Modeling using Star & Snowflake schema, De normalization, Normalization, and Aggregations.
  • Wrote Data Pipeline that fetches Adobe Omniture data which is routed to S3 using SQS every hour
  • Conducted statistical analysis on Healthcare data using python and various tools.
  • Built trust in manufacturing reports by solving critical bugs reported by the business.
  • Confer with clients regarding the nature of the Data/information processing or Reporting/Analytical needs.
  • Strong experience in working withELASTIC MAPREDUCE(EMR) and setting up environments on AmazonAWSEC2 instances
  • Experience working with Key Range Partitioning in IICS, handling File loads with concept of File list option, creating fixed with file format and more, file listener and more.
  • Experience integrating data using IICS for reporting needs.
  • Handled real-time data using Kafka.
  • Used Agile Scrum methodology and Scrum Alliance for development.

Environment: Hadoop, Java, MapReduce, AWS, HDFS, Redshift, Scala, Python, DynamoDB, Spark, Hive, Pig, Linux, XML, Eclipse, Cloudera, CDH4/5 Distribution, Teradata, EC2, Flume, Zookeeper, Cassandra, SparkMLLib, Informatica, Teradata, Hortonworks, Elasticsearch, DB2, YARN, SQL Server, Informatica, Oracle 12c, SQL, Scala, MySQL, R.

Confidential - McLean, VA

Data Engineer


  • Developed simple to complex Map Reduce streaming jobs using Java language for processing and validating the data.
  • Developed data pipeline using Map Reduce, Flume, Sqoop and Pig to ingest customer behavioral data into HDFS for analysis.
  • Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Developed MapReduce and Spark jobs to discover trends in data usage by users.
  • Using Informatica PowerCenter created mappings and mapplets to transform the data according to the business rules
  • Effectively using IICS Data integration console to create mapping templates to bring data into staging layer from different source systems like Sql Server, Oracle, Teradata, Salesforce, Flat Files, Excel Files, PWX.
  • Experience on Migrating SQL database toAzure data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
  • Experience working with IICS transformations like Expression, joiner, union, lookup, sorter, filter, normalizer and various concepts like macro fields to templatize column logic, smart match fields, renaming bulk fields and more.
  • Implemented slowly changing dimensions (SCD) for some of the Tables as per user requirement.
  • Implemented Spark using Python and Spark SQL for faster processing of data.
  • Implemented algorithms for real time analysis in Spark.
  • Responsible for designing and implementing the data pipeline using Big Data tools including Hive, Oozie, Airflow, Spark, Drill, Kylin, Sqoop, Kylo, Nifi, EC2, ELB, S3 and EMR.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
  • Created and Configured Workflows and Sessions to transport the data to target warehouse Oracle tables using Informatica Workflow Manager.
  • Experience in building Real-time Data Pipelines with Kafka Connect and Spark Streaming.
  • Used the Spark -CassandraConnector to load data to and fromCassandra.
  • Experienced withTeradata utilities Fast Load, Multi Load, BTEQ scripting, Fast Export, SQL Assistant.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • UsedInformatica as ETL tool to pull data from source systems/ files, cleanse, transform and load data into the Teradatausing Teradata Utilities.
  • Involved in understanding requirements and in modeling activities of the attributes identified from different source systems which are in Oracle, Teradata, CSV FILES. Data is Staged, integrated, Validated and finally loaded the data into Teradata Warehouse using Informatica and Teradata Utilities.
  • Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • Developed Pig Latin scripts to perform Map Reduce jobs.
  • Wrote Data Pipeline that fetches Adobe Omniture data which is routed to S3 using SQS every hour.
  • Developed product profiles using Pig and commodity UDFs.
  • Developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Created HBase tables and column families to store the user event data.
  • Written automated HBase test cases for data quality checks using HBase command line tools.
  • Created UDF’s to store specialized data structures inHBase and Cassandra.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs.
  • Used Impala to read, write and query the Hadoop data in HDFS from HBase or Cassandra.
  • Used Tez framework for building high performance jobs in Pig and Hive.
  • Configured Kafka to read and write messages from external programs.
  • Configured Kafka to handle real time data.
  • UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
  • UsedSqoopto import the data on toCassandratables from different relational databases like Oracle, MySQL and Designed Column families in Cassandra performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
  • Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka through persistence of data into HBase.
  • Written Storm topology to emit data into Cassandra DB.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Written Storm topology to accept data from Kafka Producer and process the data.
  • Installed Solr on web servers to index the search data and performed real time updates.
  • Developed core search component using Solr.
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
  • Used JUnit framework to perform Unit testing of the application
  • Developed interactive shell scripts for scheduling various data cleansing and data loading process.
  • Performed data validation on the data ingested using MapReduce by building a custom model to filter all the invalid data and cleanse the data.
  • Experience with data wrangling and creating workable datasets.

Environment: Hadoop, Map Reduce, Spark, AWS, Azure, Azure Data Lake, Pig, Hive, Sqoop, Oozie, HBase, Zoo keeper, Kafka,Flume, Solr, StormTez, Impala, Mahout, Cassandra, Teradata, Informatica, Cloudera manager, MySQL, Jaspersoft, Multi-node cluster with Linux-Ubuntu, Windows, Unix.

Confidential - Brooklyn, NY

Hadoop Developer


  • Worked in the BI team for Hadoop clusters implementation and data integration in developing large-scale system software.
  • Involved in requirement gathering for the project.
  • Developed simple to complex Map Reduce jobs in Java to perform data extraction, aggregation, transformation and rule checks on multiple file formats like XML, Json, Csv and compressed file formats.
  • Optimized Map Reduce jobs to use HDFS efficiently by using various compression mechanisms.
  • Involved in running Hadoop streaming jobs to process terabytes of text data.
  • Importing/exporting the data between RDBMS and HDFS using Sqoop.
  • Developed Hivescripts to create data meta stores and tables/partitions and load data into the tables/partitions.
  • Extensively used Pig for data cleansing and processing and performing transformations.
  • Developed Pig Latin scripts to extract the data from the web server output files to transform and load into HDFS.
  • Implemented business logic by writing UDF’s in Hive and Pig and Hive QL to process the data for analysis.
  • Exported the result sets from Hive to RDBMS using Shell scripts.
  • Involved in tuning Hive and Pig scripts to improve performance.
  • Involved in the database migrations/transfer of data from various database to HDFSand storing in different file formats (Text, Avro) and virtualization for applications.
  • Used Mahout to understand the machine learning algorithms for efficient data processing.
  • Developed and configured Oozie workflow engine for scheduling and managing the Pig, Hive and Sqoop jobs.
  • Used Zookeeper for various types of centralized configurations.
  • Used ApacheTezfor performing batch and interactive data processing applications on Pig and Hive jobs.
  • Written Storm Bolt to emit data into Hbase, HDFS, Rabbit-MQ Web Stomp.
  • Written Junit test cases for Storm Topology.
  • Deployed an Apache Solr search engine server to speed up the search process.
  • Created a wrapper library to help the rest of the team use the Solr database.
  • Customized Apache Solr to handle fallback searching and provide custom functions.
  • Involved in managing and reviewing Hadoop log files for any warnings or failures.
  • Supported in Production rollout which includes monitoring the solution post go-live and resolving any issues that are discovered by the client and client services teams.
  • Designed, documented operational problems by following standards and procedures using JIRA.

Environment: Apache Hadoop, Map Reduce, HDFS, Pig, Hive, HBase, Sqoop, Oozie, Solr, Mahout, Impala, Tez, Kafka, Storm, Zookeeper, IDE, Java, DataStax, Flat files,JIRA, Oracle 11g/10g, MySQL, Toad, SVN, Windows, UNIX.


Java Developer


  • Involved in gathering requirements and analysis through interaction with the end users.
  • Worked directly with clients in automating release management tasks, reducing defect counts in the testing phases to ensure smooth implementation of projects.
  • Involved in the design and creationofClass diagrams, Sequence diagrams and Activity Diagrams using UML models.
  • Designed and developed the application using various Design Patterns such as Front controller, Session Facade and Service Locator.
  • Developed the Search Widget using JSP, Struts, Tiles, JavaScript and AJAX.
  • Created the scripting code to validate the data.
  • Involved in developing JSP pages using Struts custom tags, jQuery and Tiles Framework.
  • Used JavaScript to perform client side validations and Struts-Validator Framework for server-side validation
  • Implemented Singletonclasses for property loading and static data from DB.
  • Debugged and developed applications using Rational Application Developer (RAD).
  • Developed a Web service to communicate with the database using SOAP.
  • Developed DAO (data access objects) using Spring Framework 3.
  • Deployed the components in to WebSphere Application server 7.
  • Generated build files using Maven tool.
  • Implemented Hibernate in the data access object layer to access and update information in the Oracle Database.
  • Developed test environment for testing all the Web Service exposed as part of the core module and their integration with partner services in Integration test.
  • Involved in writing queries, stored procedures and functions using SQL, PL/SQLand in backend tuning SQL queries/DB script.
  • Responsible for performing end-to-end system testing of application writing JUnit test cases
  • As part of the development team Contributed for Application Support in Soft launch and UAT phase and in Production support using IBM clear quest for fixing bugs.

Environment: Java EE, IBM WebSphere Application Server, Apache-Struts, EJB, Spring, JSP, Web Services, JQuery, Servlet, Struts-Validator, Struts-Tiles, Tag Libraries, Maven, JDBC, Oracle 10g/SQL, JUNIT, CVS, AJAX, Rational clear case, Eclipse, JSTL, DHTML, Windows, UNIX.

Hire Now