We provide IT Staff Augmentation Services!

Datawarehouse Engineer Resume

Chicago, IL


  • Excellent understanding/knowledge of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
  • 8 Years of experience in IT industry comprising of extensive work experience includes 4 years of experience in Big Data technologies.
  • Hands on experience on major components in Hadoop Ecosystem like Hadoop MapReduce, HDFS, HIVE, PIG, HBase, Sqoop, Oozie and Flume.
  • Excellent Knowledge of HIPAA standards, EDI (Electronic data interchange), transaction syntax like ANSI X12, Implementation and Knowledge of HIPAA code sets, ICD - 9, ICD-10 coding and HL7.
  • Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters.
  • Expertise in working with different kind of data files such XML, JSON, Parquet, Avro and Databases.
  • Experience in shell and Python scripting languages.
  • Extensive experience in developing Pig Latin Scripts for transformations and using Hive Query Language for data analytics.
  • Good experience working on both Hadoop distributions: Cloudera and Hortonworks.
  • Involved in Spark-Cassandra data modeling.
  • Worked on Apache FLUME distributed service.
  • Experience in Importing and exporting data from different databases like MySQL, Oracle, Teradata into HDFS and vice-versa using Sqoop.
  • Hands on experience with working on Spark using both Scala and Python .
  • Performed various actions and transformations on spark RDD's and Data Frames .
  • Experience with Oozie Workflow Engine in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs.
  • Implemented Oozie workflows using Sqoop, pig, hive, shell actions and Oozie coordinator to automate tasks.
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
  • Have hands on experience in writing MapReduce jobs in Java, Pig and Python and have written MapReduce programs for the analysis of data and to discover trends of data usage by the users.
  • Experienced with batch processing of data sources using Apache Spark.
  • Hands on Experience in Writing Python Scripts for Data Extract and Data Transfer from various data sources.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Experienced in implementing Spark RDD transformations actions to implement business analysis.
  • Used Flume to collect, aggregate and store the web log data onto HDFS.
  • Used Zookeeper for various types of centralized configurations.
  • Extensive knowledge and experience on real time data streaming technologies like Kafka, Storm and Spark Streaming.
  • Designed and implemented complex SSIS package to migrate data from multiple data sources.
  • Developed queries for drill down, cross tab, sub reports and ad-hoc reports using SSRS.
  • Processed data from cubes and SSIS to generate reports by report server in SSRS.
  • Strong ability to understand new concepts and applications.
  • Excellent Verbal and Written Communication Skills have proven to be highly effective in interfacing across business and technical groups.


Hadoop Ecosystem Development: HDFS, MapReduce, Hive, Impala, Pig, Oozie, HBase, Sqoop, Flume, Yarn, Scala, Spark, Kafka, Flume, and ZooKeeper.

Hadoop Distribution System: Cloudera, Hortonworks.

Languages: PL, SQL, Transact: SQL, SQL, C/C++, JAVA, Scala, Python.

Scripting: Pig Latin.

Database: Teradata, MSQL, MS: SQL, Hive.

NoSQL Database: Apache HBase, Mongo Db, Cassandra.

ETL Tools: Apache Pig, Pentaho Kettle and Tableau.

Web Design Tools: HTML, DHTML, REST, AJAX, JavaScript, JQuery and CSS, AngularJS, ExtJS and JSON.

Frame works: MVC, Struts, Hibernate and Spring.

Operating Systems: Linux (Centos, Ubuntu), Unix, Windows 7/Vista/XP/2000/NT, Server 2012/2008R2, Mac.


Confidential, Chicago, IL

Datawarehouse Engineer


  • Analyzed all the tables in DB and listed out the classified columns. Created hashing algorithms in python to hash those columns.
  • Created User views on Teradata to enforce data abstraction on hive tables.
  • Used Teradata Parallel Transfer (TPT) to load data from hive/HDFS to Teradata and vice-versa.
  • Worked on loading of data from several flat files sources to Staging using MLOAD, FLOAD.
  • Used Flume to collect, aggregate, and store the web log data from different sources like web servers and pushed to HDFS.
  • Successfully loaded files to Hive and HDFS from SQL Server using SQOOP .
  • Performed Data Cleansing using Python and loaded into the target tables.
  • Managing and scheduling Jobs through Opswise Scheduler .
  • Worked on ETL process and handled importing data from various data sources, performed transformations
  • Changed the existing ETL pipeline to become GDPR compliant.
  • Optimized Teradata query and ETL jobs to reduce the pipeline time by 30%.
  • Developed YAML scripts to automate data pipelines.
  • Developed Hive Queries for analyzing data in Hive warehouse using Hive Query.
  • Worked with NoSQL databases like HBase for creating HBase tables to load large sets of semi structured data coming from various sources.
  • Developed python scripts for automatic purging of data on Hadoop clusters.
  • Collected the JSON data from HTTP Source and developed Spark APIs that helps to do inserts and updates in Hive tables.
  • Developed a script in Scala to read all the Parquet data in a HDFS and parse them as Orc files , another script to parse them as structured tables in Hive.
  • Involved in performance tuning of Spark jobs using Cache and complete advantage of cluster environment.
  • Designed a Rights Tracks as part of GDPR and developed a data pipeline using Teradata and hive for Rights Track.

Environment: Hadoop, HDP, MapReduce, Hive QL, MySQL, Teradata, TPT, SQL, HBase, HDFS, HIVE, Impala, PIG, Sqoop, Oozie, Apache Spark, Python, Scala, Zookeeper, Hue, Opswise, YAML, UNIX.

Confidential, Rochester, MN

Spark Developer


  • Created and enforced policies to achieve HIPAA compliance.
  • Monitor System health and logs and respond accordingly to any warning or failure conditions.
  • Helped the team to increase Cluster from 25 Nodes to 40 Nodes.
  • Involved in maintaining various Unix Shell scripts.
  • Migrated 160 tables from Oracle to HDFS and HDFS to Cassandra using Apache Spark.
  • ETL off-loading from Teradata to HDFS.
  • Implemented Python code for retrieving the Social Media data.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Developed a data pipeline using Kafka and Storm to store data into HDFS.
  • Involved in importing the real-time data to Hadoop using Kafka and implemented the Oozie job for daily imports.
  • Automated all the jobs starting from pulling the Data from different Data Sources like MySQL to pushing the result set Data to Hadoop Distributed File System using Sqoop.
  • Import the data from different sources like HDFS/HBase into Spark RDD.
  • Involved in managing and reviewing Hadoop log files.
  • Involved in running Hadoop streaming jobs to process terabytes of text data.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch processing.
  • Developed HIVE queries for the analysts.
  • Developed counters on HBase data to count total records on different tables.
  • ETL Data Cleansing, Integration & Transformation using Pig scripts for managing data from disparate sources.
  • Used HBase to store majority of data which needs to be divided based on region.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Developed Spark Code using Python for faster processing of data.
  • Used Coalesce and repartition on data frames while optimizing the Spark jobs.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop Using Spark Context, Spark-SQL, Data Frame, Pair RDD's and YARN.
  • Performed performance tuning for Spark Steaming e.g. setting right Batch Interval time, correct level of Parallelism, selection of correct Serialization & memory tuning.
  • Analyze business requirements and data sources from Excel, Oracle, and SQL Server for design, development, testing, and production rollover of reporting and analysis projects within Tableau.
  • Used Zookeeper for various types of centralized configurations.
  • Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the MapReduce jobs given by the users.
  • Worked on Creating Kafka topics, partitions, writing custom partitioned classes.
  • Worked on Creating Kafka Adaptors for decoupling the application dependency.
  • Worked on creating custom ETL scripts using Python for business related data.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports.
  • Integrated BI tool with Impala for visualization.
  • Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks.

Environment: Hadoop, CDH 5, MapReduce, Hive QL, MySQL, HBase, HDFS, HIVE, Impala, PIG, Sqoop, Oozie, Flume, Apache Spark, Python, Scala, Cloudera, Zookeeper, Hue Editor, Oracle 11g, PL/SQL, UNIX, Tableau.

Confidential, Golden Valley, MN

Data Analyst


  • Created SSIS Packages to import the data from Oracle databases, XML, text files, Excel files.
  • Developed queries for drill down, cross tab, sub reports and ad-hoc reports using SSRS.
  • Wrote optimized SQL queries in SQL Query Analyzer for efficient handling at huge loads of data
  • Created views to restrict access to data in a table for security purposes.
  • Created SSIS package to extract, validate and load data into Data warehouse.
  • Develop backup & restore scripts for SQL Server as needed.
  • Design and implementation of database maintenance plan.
  • Job, Scheduling, batch, alert and E-mail notification setting.
  • Involved in start to end process of Hadoop cluster installation, configuration and monitoring.
  • Data migration from existing data stores to Hadoop.
  • Handled importing of data from various data sources, performed transformations using Hive MapReduce, loaded data into Hadoop Distributed File System (HDFS) and extracted the data from MySQL into HDFS vice-versa using Sqoop.
  • Involved in loading data from UNIX file system to HDFS.
  • Involved in managing and reviewing Hadoop log files.
  • Involved in creating Hive tables, loading with data and writing hive queries views and worked on them using Hive QL.
  • Designed a data warehouse using Hive.
  • Developed the Pig UDF'S to pre-process the data for analysis.
  • Supported Data Analysts in running MapReduce Programs.
  • Worked on Performance tuning on MapReduce jobs.
  • Worked on Cloudera distribution system for running Hadoop jobs on it.
  • Worked on analyzing Data with HIVE and PIG.
  • Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
  • Used Tableau to create reports representing analysis in graphical format.
  • Created customized reports and processes in Tableau Desktop.
  • Worked on data analysis and send the reports to clients on daily basis.
  • Worked with team and collaborated to meet project timelines.

Environment : CDH 3, PL/ SQL, My SQL, SQL Server 2008(SSRS & SSIS), Hadoop, MapReduce, HDFS, Pig, Hive, Sqoop, Java, UNIX, Tableau.


SQL Developer


  • Create database objects including tables, triggers, views, stored procedures, indexes, defaults and rule.
  • Tuning and optimizing queries and indexes.
  • Monitor SQL Server log files.
  • Converted and loaded data from different databases and files.
  • Perform optimization of SQL queries in SQL Server and Sybase system.
  • Create jobs and monitor job history for maximum availability of data and to ensure consistency of the databases.
  • Create and maintain databases after logical and physical database design.
  • Maintain Disaster recovery strategies for the database and fail-over methods.
  • Perform optimization of SQL queries in SQL Server and Sybase system.
  • Monitored the performance of Database Server.
  • Maintenance of clustered and non-clustered indexes.
  • Monitor server space usage and generate reports.
  • Worked with SharePoint Server 2007 while deploying SSRS reports.
  • Create and maintain database for Incident, Problem Tracking, and Metrics.
  • Created packages in SSIS with error handling and mapping using different tasks in the designer.
  • Designed and implemented complex SSIS package to migrate data from multiple data sources.
  • Used the transformations such as Merge, Data Conversion, Conditional Split and Multicast to distribute and manipulate data to the destination in SSIS.
  • Processed data from cubes and SSIS to generate reports by report server in SSRS.

Environment : PL/ SQL, My SQL, SQL Server 2008(SSRS & SSIS), Visual studio, MS Excel.


Programmer Analyst/ SQL Developer


  • Developed SQL Scripts to perform different joins, sub queries, nested querying, Insert/Update and Created and modified existing stored procedures, triggers, views, indexes.
  • Responsible in maintaining databases.
  • Performed intermediate queries using SQL, including Inner/Outer/Left Joins and Union/Intersect.
  • Responsible in implementing and monitoring database systems.
  • Designed and modified physical databases with development teams.
  • Worked with Business Analysts and Users to understand the requirement.
  • Responsible for the designing the advance SQL queries, procedure, cursor, triggers.
  • Build data connection to the database using MS SQL Server.
  • Worked on project to extract data from xml file to SQL table and generate data file reporting using SQL Server 2008.
  • Created Drill-through, Drill-down, Cross Tab Reports, Cached reports and Snapshot Report to give all the details of various transactions like closed transactions, pending approvals and summary of transactions and scheduled this report to run on monthly basis.
  • Created reports and designed graphical representation of analyzed data using reporting tools

Environment : MS SQL Server 2008/2005, SQL Server Integration Services 2008, SQL Server Analysis Services 2008, MS Visual Windows 2003/2000.

Hire Now