We provide IT Staff Augmentation Services!

Data Engineer Resume


  • Over 9 years of diversified IT experience in E2E data analytics platforms (ETL - BI-Java) as Bigdata, Hadoop, Java/J2EE Development, Informatica, Data Modeling and System Analysis, In Banking, Finance, Insurance and Telecom domains.
  • Worked for 4 years with AWS-BigData/Hadoop Ecosystem in the implementation of DataLake.
  • Hands on experience Hadoop framework and its ecosystem like Distributed file system (HDFS), MapReduce, Pig, Hive, Sqoop,Flume, Spark.
  • Experience in layers of Hadoop Framework - Storage (HDFS), Analysis (Pig and Hive), Engineering (Jobs and Workflows), extending the functionality by writing custom UDFs.
  • Extensive experience in developing Data warehouse applications using Hadoop, Informatica, Oracle, Teradata, MS SQL server on UNIX and Windows platforms and experience in creating complex mappings using various transformations and developing strategies for Extraction, Transformation and Loading (ETL) mechanism by using Informatica 9.x/8.x.
  • Proficient in Hive Query language and experienced in hive performance optimization using Static-Partitioning, Dynamic-Partitioning, Bucketing and Parallel Execution concepts.
  • As Data Architect designed and maintained high performance ELT/ETL processes.
  • Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java, custom UDF s.
  • Good Understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
  • Knowledge on Cloud computing infrastructure AWS (amazon web services).
  • Created modules for spark streaming in data into Data Lake using Strom and Spark.
  • Experience in Dimensional Data Modeling Star Schema, Snow-Flake Schema, Fact and Dimensional Tables,concepts like Lambda Architecture, and Batch processing,Oozie.
  • Extensively used Informatica client tools Source Analyzer, Warehouse designer, Mapping designer, Mapplet Designer, ETL Transformations, Informatica Repository Manager and Informatica Server Manager, Workflow Manager & Workflow Monitor.
  • Expertise in using core Java, J2EE, Multithreading, JDBC, Shell Scripting and proficient in using Java API's Collections, Servlets, JSP for application development.
  • Worked closely to review pre- and post-processed data to ensure data accuracy and integrity with Dev and QA teams.
  • Experience in Java, J2ee, JDBC, Collections, Servlets, JSP, Struts, Spring, Hibernate, JSON, XML, REST, SOAP Web services, Groovy, MVC, Eclipse, Weblogic, Websphere, and Apache Tomcat severs.
  • Working experience with Functional programming languages like Scala, and Java.
  • Extensive knowledge of Data Modeling, Data Conversions, Data integration and Data Migration with specialization in Informatica Power Center.
  • Expertise in extraction, transformation and loading data from heterogeneous systems like flat files, excel, Oracle, Teradata, MSSQL Server.
  • Good work experience with UNIX/Linux commands, scripting and deploying the applications on the servers.
  • Strong skills in algorithms, data structures, Object oriented design, Design patterns, documentation and QA/testing.
  • Experienced in working as part of fast paced Agile Teams, exposure to testing in scrum teams, Test-Driven development.
  • Excellent domain knowledge in Insurance, Telecom and Banking/Finance.


BigData Technologies: AWS EMR, S3, EC2-Fleet, Spark-2.0, Hortonworks HDP, Hadoop, Mapreduce, Pig, Hive, Apache Spark, SparkSQL, Informatica Power Center 9.6.1/8.x, Kafka, NoSQL, Elastic Mapreduce(EMR), Hue,YARN, Nifi, Impala, Sqoop, Solr, OOZie.

Databases: Hortonworks HDP, Oracle 10g/11g, Teradata, DB2,Microsoft SQL Server, MySQL, noSQL,SQL databases.

Platforms (O/S): Red-Hat LINUX, Ubuntu, Windows NT/2000/XP.

Programming languages: Java, Scala, SQL, UNIX shell script, JDBC, Python, Perl.

Security Management: Hortonworks Ambari, Cloudera Manager, Apache Knox, XA Secure, Kerberos .

Web-technologies: DHTML, HTML, XHTML, XML, XSL (XSLT, XPATH), XSD, CSS, JavaScript, SOAP, RESTful, Agile, Design Patterns

Data warehousing: Informatica Powercenter/Powermart/Dataquality/Bigdata, Pentaho, ETL Development, Amazon Redshift, IDQ.

Database Tools: JDBC, HADOOP, Hive, No-SQL, SQL Navigator, SQL Developer, TOAD, SQL Plus, SAP Business Objects

Data Modeling: Rational Rose, Erwin 7.3/7.1/4.1/4.0

Code Editors: Eclipse, Intellij



Data Engineer


  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR 5.6.1.
  • Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HDFS.
  • Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
  • Worked on creating Spring-Boot services for Oozie orchestration.
  • Deployed Spring-Boot entity services for Audit Framework of the loaded data.
  • Worked with Avro, Parque, ORC file formats and compression techniques like LZO.
  • Used Hive to form an abstraction on top of structured data resides in HDFS and implemented Partitions, Dynamic Partitions, Buckets on HIVE tables.
  • Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Worked on migrating MapReduce programs into Spark transformations using Scala.
  • Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
  • Used Job management scheduler apache Oozie to execute the workflow.
  • Used Ambari to monitor node's health and status of the jobs in Hadoop clusters.
  • Designing and implementing data warehouses and data marts using components of Kimball Methodology, like Data Warehouse Bus, Conformed Facts & Dimensions, Slowly Changing Dimensions, Surrogate Keys, Star Schema, Snowflake Schema, etc.
  • Worked on Tableau to build customized interactive reports, worksheets and dashboards.
  • Implemented Kerberos for strong authentication to provide data security.
  • Implemented LDAP and Active directory for Hadoop clusters
  • Worked on apache Solr for indexing and load balanced querying to search for specific data in larger datasets.
  • Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.

Environment: AWS- S3, EMR, Lambda, CloudWatch, Amazon Redshift, Spark-Java, Spark- Scala, Athena, Hive, HDFS, Spark, Scala, Oozie, Bitbucket Github.

Confidential, Tampa, FL

Data Engineer


  • Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
  • Worked with different feeds data like JSON, CSV, XML,DAT and implemented Data Lake concept.
  • Developed Informatica design mappings using various transformations.
  • Most of the infrastructure is on AWS, used,
  • Maintained end to end ownership for analyzed data, developed framework’s, Implementation building and communication of a range of customer analytics projects.
  • Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark framework etc.)
  • Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events.
  • Performed data analysis and predictive data modeling.
  • Explore clickstream events data with SparkSQL.
  • Architecture and Hands-on production implementation of the big data MapR Hadoop solution for Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and advertising data related to Consumer Product Goods.
  • Spark SQL is used as a part of Apache Spark big data framework for structured, Shipment, POS, Consumer, Household, Individual digital impressions, Household TV impressions data processing.
  • Created DataFrames from different data sources like Existing RDDs, Structured data files, JSON Datasets, Hive tables, External databases.
  • Load terabytes of different level raw data into Spark RDD for data Computation to generate the Output response.
  • Data from HDFS into Spark RDDs, for running predictive analytics on data.
  • Used Hive Context which provides a superset of the functionality provided by SQLContext and Preferred to write queries using the HiveQL parser to read data from Hive tables (fact, syndicate).
  • Modeled Hive partitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
  • Caching of RDDs for better performance and performing actions on each RDD.
  • Created Hive Fact tables on top of raw data from different retailer’s which indeed partitioned by Time dimension key, Retailer name, Data supplier name which further processed pulled by analytics service engine.
  • Developed highly complex Python and Scala code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
  • Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
  • Leadership of a major new initiative focused on Media Analytics and Forecasting will have the ability to deliver the sales lift associated the customer marketing campaign initiatives.
  • Responsibility includes platform specification and redesign of load processes as well as projections of future platform growth.
  • Coordinating the QA, PROD environments deployments.
  • Python was used in automation of Hive and Reading Configuration files.
  • Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
  • Using Hive to analyze the partitioned data and compute various metrics for reporting.

Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6, SQL, MapR, Sqoop, Zookeeper, AWS EMR,AWS S3,Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile Methodology, Scrum.


Hadoop Consultant


  • Understand the requirements and prepared architecture document for the Big Data project.
  • Worked with HortonWorks distribution
  • Supported MapReduce Java Programs those are running on the cluster.
  • Optimized Amazon Redshift clusters, Apache Hadoop clusters, data distribution, and data processing
  • Developed MapReduce programs to process the Avro files and to get the results by performing some calculations on data and also performed map side joins.
  • Imported Bulk Data into HBase Using MapReduce programs.
  • Programmed ETL functions between Oracle and Amazon Redshift.
  • Used Rest ApI to Access HBase data to perform analytics.
  • Designed and implemented Incremental Imports into Hive tables.
  • Involved in creating Hive tables, loading with data and writing Hive queries that will run internally in MapReduce way
  • Involved in collecting, aggregating and moving data from servers to HDFS using Flume.
  • Imported and Exported Data from Different Relational Data Sources like DB2,SQL Server, Teradata to HDFS using Sqoop.
  • Migrated complex map reduce programs into in memory Spark processing using Transformations and actions.
  • Experienced in collecting the real-time data from Kafka using Spark Streaming and perform transformations and aggregation on the fly to build the common learner data model and persists the data into Hbase.
  • Worked on POC for IOT devices data, with spark.
  • Used SCALA to store streaming data to HDFS and to implement Spark for faster processing of data.
  • Worked on creating the RDD's, DF's for the required input data and performed the data transformations using Spark Python.
  • Involved in developing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
  • Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
  • Developed PIG scripts for the analysis of semi structured data.
  • Developed PIG UDF'S for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
  • Worked on Oozie workflow engine for job scheduling.
  • Developed Oozie workflow for scheduling and orchestrating the ETL process.
  • Experienced in managing and reviewing the Hadoop log files using Shell scripts.
  • Migrated ETL jobs to Pig scripts to do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
  • Worked on different file formats like Sequence files, XML files and Map files using MapReduce Programs.
  • Worked with Avro Data Serialization system to work with JSON data formats.
  • Used AWS S3 to store large amount of data in identical/similar repository.
  • Involved in build applications using Maven and integrated with Continuous Integration servers like Jenkins to build jobs.
  • Used Enterprise Data Warehouse database to store the information and to make it access all over organization.
  • Responsible for preparing technical specifications, analyzing functional Specs, development and maintenance of code.
  • Worked with the Data Science team to gather requirements for various data mining projects
  • Written shell scripts for rolling day-to-day processes and it is automated.


Senior Application Consultant

Environment: Hadoop 2.0,Sqoop, Java, Apache Hbase, Informatica Power Center, IDQ analyst, DB Visualizer, Windows.


  • Creating consolidated loss information file of various levels of business such as Claim, Policy, and Transaction and miscellaneous data.
  • The source system is tables from Claim Center of Guidewire, and Policy/Insurance management system.
  • Executed POCs for using Amazon Redshift, to test the feasibility of the DWH fit in our requirement.
  • Migration of the claims data in oracle to the analytical data base created in Hadoop with sqoop,Worked extensively with Sqoop for importing metadata from Oracle.
  • This loss information file is supplied to mainframe for completing further business batch processes.
  • Worked hands on with ETL/ETL scripts process.
  • Involved creation of extended dimension modeling tables like snowflake for OBIEE.
  • Key role in designing and implementing map reduce based applications for data validation. Data involves records and logs received from various production devices of the client.
  • Analyzed Session Log files in case the session fails in order to resolve errors in mapping or session configurations.
  • Work with Data Governance team and implement the rules and build physical data model on hive in the data lake.
  • Mentored and delivered trainings to other team members on Hadoop ecosystem targeting MapReduce and Hive for cross-skill training.
  • Written multiple MapReduce procedures to power data for extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV, Avro & other compressed file formats.
  • Performed unit testing,Data Reconciliation, knowledge transfer and mentored other team members.


Sr. ETL Consultant

Environment: Hadoop, Informatica 9.01, Teradata-13, Abnitio, UNIX, DB2.


  • Gathering requirements for RDM project which involves implementing EDW data quality fixes and Retail data mart.
  • Prepare functional and technical specification design document for building Member Data Mart according to ICDW Banking Model.
  • Responsible for data gathering from multiple sources like Teradata, Oracle.
  • Created Hive tables to store the processed results in a tabular format.
  • Written Map Reduce jobs in java to process the log data.
  • Implemented external and managed tables using HIVE.
  • Work with the Teradata analysis team using BigData technologies to gather the business requirements.
  • Fixing error data, Data Reconciliation process.
  • Used Partitioning and bucketing concepts for performance optimization in hive.
  • Responsible for delivering the Informatica artifacts for Mart Specific Semantic Layer for subject areas like Reference, Third Party, Involved Party, Event, Customer and etc.
  • Reviewing the deliverable and ensured that the quality of code before delivering to client by reviewing the code and testing the code.
  • Involved in implementation as Kimball’s methodology, OLAP, SCDs (type1, type2 and type3), starschema and snowflake schema.
  • Involved in understanding the existing EDW process of Retail Business and implementing the components in the ICDW.
  • Prepared and implemented successfully automated UNIX scripts to execute the end to end history load process.
  • Prepared Job execution tool Tivoli design in order to run Membership Reporting Data mart in production environment.
  • Managing the versioning of the mappings, scripts, documents in version controlled tool SCM .


Software Consultant

Environment: Java, Informatica 8.6, Oracle 10g, JavaScript, Spring, SQL, Perl, PL/SQL, Python, Shell scripting, Windows, UNIX, MSSQL.


  • Interacted with clients and business users. Involved in requirement gathering and impact analysis.
  • Monthly Billing Process (MBP) application runs every month which includes processes for creating customer invoices and usage reports.
  • Fiberlink invoices customers for Connectivity Charges (Dial, Wi-Fi and Broadband), Software License Fee (E360 and Third Party Apps), Monthly Recurring Charges (Custom Reports, Managed Services).
  • It consists of a set of automated and manual steps managed by the billing team and Finance team.
  • Member of a development team responsible for design, development and testing of server-based software that provides secure mobile workforce solutions.
  • Created technical design draft documentation.
  • Enrollment is a process of registering a handheld device with the portal and Management of a device includes collecting information about a device.
  • Enrollment part of the application has ability to notify users using SMS and email.
  • These requests are used by end-users to register their devices with portal so that they can be managed by the administrator, Also there are few usability related things like QRcode integration in the email's sent to the user so that user can easily register the device without much effort.
  • Created conceptual, logical and physical data models.
  • Written Procedures, Functions & Triggers for different operations.
  • Performance tuning of query and scripts.
  • Used BCP to import data to stage tables.
  • Post production User support.
  • Coordinated with the Quality Assurance/Testing team members to perform both SIT and UAT testing.

Hire Now