We provide IT Staff Augmentation Services!

Sr. Hadoop Developer Resume

2.00/5 (Submit Your Rating)

Pittsburgh, PA

SUMMARY:

  • Over 11 years of experience in full life cycle Data Warehousing projects. 3+ years of work experience in designing and implementing complete end - to-end Big Data solution using Hadoop Ecosystem tools including HDFS, MapReduce, Hive, Pig, HBase, Zoo Keeper, Sqoop, Flume, Oozie, Spark and Kafka
  • Performed Infrastructure capacity planning considering various workload patterns and actively participated in hardware design for Edge, Master and DataNodes
  • Experience in installing and configuring Hadoop cluster using different distributions of Apache Hadoop like Cloudera distribution and EMR
  • Good noledge of Hadoop Architecture and various Hadoop 1.0 and Hadoop 2.0 components such as HDFS, MapReduce, NameNode, Secondary NameNode, Standby NameNode, DataNode, Job Tracker, Task Tracker, Resource Manager, Application Master, Node Manager and YARN
  • In depth exposure to HDFS concepts, blocks, master nodes, slave nodes, HDFS high-availability including both manual and automatic failover, command line interface, file reading and writing, copying a file from one cluster to another
  • Experience in analyzing data using Hive QL, Pig Latin and custom MapReduce programs in Java
  • Designed and Modeled Hive database using Partitioned and Bucketing tables wif storing data in various file systems like Parquet, Avro, RC, ORC and Text File
  • Experience in importing and exporting batch data using Sqoop and real-time data using Flume from Relational Database Systems to HDFS and vice-versa
  • Extensively worked wif Job/workflow scheduling tool Oozie and designed both time driven and data driven automated workflows
  • Created custom UDFs for Hive and Pig
  • Experience in monitoring infrastructure for Hadoop cluster using Nagios and Ganglia
  • Good working experience in writing spark application in Scala using Spark Context, Spark SQL and Spark Streaming
  • Experienced in using Integrated Development environments like Eclipse, Intellij IDEA and Version Control tools like SVN, GIT, TEAMCITY and Build tools like Maven and SBT
  • Worked and migrated RDBMS databases into different NoSQL database
  • Experience in Cluster maintenance processes which includes both HDFS, tasks like commissioning and decommissioning of datanode, File system check, balancing the cluster and Map Reduce tasks including commissioning and decommissioning a Task Tracker, killing of both Task Tracker and Job Tracker and dealing wif blacklisted Task Tracker
  • Analyzing the clients existing Hadoop infrastructure and understand the performance bottlenecks and provide the performance tuning accordingly
  • Experience in deploying Hadoop cluster on Public and Private Cloud Environment like Amazon AWS
  • Knowledge of apache grid gain.
  • Experience in PL/SQL programming including SQL queries using stored procedures and triggers in Oracle, SQL Server using TOAD and Query Manager
  • Experience in integration of various data sources like Oracle9i/10g, DB2, Teradata, SQL Server, MS Access, XML and Flat files
  • Extensively worked on Data Warehouse design methodologies (Top-Down, Bottom-Up)
  • Solid experience in Design and Development of Extraction, Transformation, Load and Business Intelligent processes
  • Excellent technical, communication, analytical and problem-solving skills and ability to get on well wif people including cross-cultural backgrounds.
  • Extensively worked in leading ETL design, development, testing and implementation phases. Created logical and physical mapping documents and low level/high level ETL design documents using Info Sphere Information server business suites.
  • Experience working wif Dimensional Modeling using Star Schema and Snow-Flake schema. Extensively worked in creating Conceptual, Logical and Physical Data Models and deployed on target databases using Erwin.
  • Experience working wif Cognos Framework Manager, Reporting Studio, Analysis Studio, Query Studio and Cognos Connections

TECHNICAL SKILLS:

Cloudera Enterprise: CDH 5.4, Hadoop 2.6, Hive 1.1, HBase 1.0, Sqoop 2.0, Spark 1.3, Linux RHEL 6.x,Shell Programming, Aginity Workbench, Tableau, GIT, TeamCity Impala, SAS, Oracle SQL Developer, scala

ETL Tools: IBM Information Server 9.1/ 8.7/8.5/8.1/7.5 .2 (EE/Server)/6.0, QualityStage 8.7/8.5/8.1/7.5 x. SSIS 2008/2005, SQL*Loader & OWB

Reporting Tools: Cognos 10.2, SSRS, OBIEE

Dimensional Data Modeling: Star Schema Modeling, Snow-Flake Modeling, FACT and Dimensions tables, physical and logical data modeling, Erwin 4.1.2/3.x and Oracle Designer

Programming Languages: C, C++, Java,scala

Databases: MS SQL Server 2000/2005, Oracle 11g,10g/9i/8, PL/SQL, DB2 UDB, Terdata 12/13

UNIX Tools: Shell Scripts, C Shell, K-Shell Scripts, AWK and VI-editor, Perl Scripting.

GUI: Visual Basic 6.0/5.0/4.0, FrontPage 98/2000, Excel, Power Point, Visual Interdev 6.5/6.0 and Visual Age 3.5

Windows: Windows Server 2008, Windows batch scripting

File Type: txt, Avro, Parquet, RC, ORC

PROFESSIONAL EXPERIENCE:

Confidential, Pittsburgh, PA

Sr. Hadoop Developer

Responsibilities:

  • Responsible for developing efficient MapReduce on AWS cloud programs for more than 20 years' worth of claim data to detect and separate fraudulent claims.
  • Worked wif the advanced analytics team to design fraud detection algorithms and then developed MapReduce programs to efficiently run the algorithm on the huge datasets.
  • Ran data formatting scripts in python and created terabyte csv files to be consumed by Hadoop MapReduce jobs.
  • Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
  • Involved in administration, installing, upgrading and managing CDH3, Pig, Hive & HBase.
  • Played a key-role is setting up a 50 node Hadoop cluster utilizing Apache Spark by working closely wif the Hadoop Administration team.
  • Created Hive tables to store data into HDFS, loading data and writing hive queries dat will run internally in map-reduce way.
  • Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop and Flume.
  • Involved in Cluster coordination services through Zookeeper.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Played a key role in installation and configuration of the various Hadoop ecosystem tools such as Solr, Kafka, Pig, HBase and Cassandra.
  • Implemented various hive optimization techniques like Dynamic Partitions, Buckets, Map Joins, Parallel executions in Hive.
  • Involved in scheduling Airflow workflow engine to run multiple Hive and pig jobs using python.
  • Used Flume to collect the logs data wif error messages across the cluster.
  • Extracted meaningful data from dealer csv files, text files, and mainframe files and generated Pythonpanda's reports for data analysis.
  • Utilized Python to run scripts, generate tables, and reports.
  • Designed and Maintained Oozie workflows to manage the flow of jobs in the cluster.
  • Parse Json files through Spark core to extract schema for the production data using SparkSQL and Scala.
  • Actively updated the upper management wif daily updates on the progress of project dat include the classification levels dat were achieved on the data.

Environment: Hadoop, HDFS, Pig, Hive, MapReduce, Sqoop, Kafka, CDH3, Cassandra, Python, Oozie, Java collection, Scala, AWS cloud, SQL, NoSQL, Bitbucket, Jenkins, HBase, Flume, spark, Solr, Zookeeper, ETL, Centos, Eclipse.

Confidential, Trenton, NJ

DataStage/ETL Lead

Responsibilities:

  • Provided use case for Centralize Staging Area on Hadoop cluster
  • Provided proof of concept on virtual single node Hadoop instance
  • Performed capacity planning for Hadoop cluster for all environments (Dev, Test, Prod)
  • Planned the Hadoop cluster selecting hardware and operating system, Configuring disk and designing network
  • Participated in installing and configuring Cloudera (CDH4) SuSE package on 40 node cluster
  • Configured Rack awareness and Namenode high availability and federation
  • Performed Maintenance of Hadoop cluster - Starting and Stopping services, Adding and Decommissioning data nodes, Dealing wif replication storm and Balancing HDFS block data
  • Designed and implemented Backup and Recovery strategy for Data and Namenode metadata
  • Extensively worked on Hadoop ecosystem products - HDFS, Map Reduce, Hive, Impala, Hbase and Oozie
  • Extensively worked on HiveQL to create and modify databases, create external and managed partitioned tables, loading data into tables, Exporting data.
  • Created Hive views to reduce query complexity and restrict access to data based on conditions
  • Created Hive indexes on columns to speed some operations
  • Used Impala to run some ad-hoc queries
  • Wrote MapReduce programs in Java to process semi-structured xml and unstructured text files
  • Designed and Developed Shell Scripts and Sqoop Scripts to migrate data in and out of HDFS
  • Designed and Developed Oozie workflows to execute MapReduce jobs, Hive scripts, shell scripts and sending email notifications
  • Worked closely wif the ETL Lead, Technical Lead, Data Modeler, Business Analysts to understand business requirements, providing expert noledge and solutions on Data Warehousing, ensuring delivery of business needs in a timely cost-effective manner
  • Understood the Functional Requirements and Transformations on the source data and prepared Technical Specification
  • Prepared project estimation based on number of jobs in each interface and job complexities
  • Based on Technical Specification Document designed and created Datastage jobs to extract, transform and load data from source into Data warehouse and then into Datamarts
  • Loaded Datamart wif PL/SQL only.
  • Worked on pipeline and partitioning parallelism techniques and ensured load balancing of data
  • Deployed different partitioning methods like Hash by field, Round Robin, Entire, Modulus, and Range for bulk data loading
  • Implemented Slowly Changing Dimension Type- 2 in Datastage
  • Extensively used DataStage Designer stages such as Aggregator, Transformer, Join, Dataset, Lookup, Funnel, Peek, Pivot
  • Implemented various shared containers for Re-using the business functionality
  • Followed Datastage coding standards and best practices
  • Performed Unit Testing in Data Stage and tracked results in unit Test cases (UTC)
  • Performed extensive performance tuning in Datastage, to handle big volume of data
  • Lead the other team members and guided the testing team members in executing the jobs, which helped them to develop their noledge on the tool
  • Performed code-fix in Datastage for the defects rose from system testing and the Business System Analyst (BSA's)
  • Addressed defects on Datastage code added in Quality Center
  • Responsible for data analysis, requirements gathering, source-to-target mapping, process flow diagrams and documentation
  • Created UNIX shell scripts for File Transfer and File validation during parallel job execution
  • Tuned SQL queries for better performance for processing business logic in the database

Environment: Hadoop 2.6, Hive 1.1, HBase 1.0, Sqoop 2.0, Spark 1.3, Linux RHEL 6.x, Oracle 11g, Shell Programming, Aginity Workbench, Tableau, TeamCity, SAS IBM / Ascential DataStage 8.7,8.5, Oracle11g, SQL Server 2012, MS Word, MS Access, Unix Windows 2007, Erwin 4.1.

Confidential, Pleasanton, CA

DataStage/ETL Lead

Responsibilities:

  • Prepared technical specifications and project estimates.
  • Designed and created DataStage jobs to Extract, transform and load data from source into data warehouse and data marts.
  • Worked on pipeline and partitioning parallelism techniques and ensured load balancing of data.
  • Deployed different partitioning methods like Hash by field, Round Robin, Entire, Modulus, and Range for bulk data loading.
  • Implemented Slowly Changing Dimension Type- 2 in DataStage.
  • Extensively used DataStage Designer stages such as Aggregator, Transformer, Join, Dataset, Lookup, Funnel, Peek, and Pivot.
  • Implemented various shared containers for re-using the business functionality.
  • Involved in import and export using Data Stage Manager.
  • Responsible for data analysis, requirements gathering, source-to-target mapping, process flow diagrams, and documentation.
  • Created UNIX shell scripts for File Transfer and File validation during parallel job execution.
  • Analyzed the Data Acquisition, Data Integration, Data Transformation, and Data Delivery processes for Performance Assessment.
  • Modified existing ETL and BI architecture to achieve maximum parallel processing and reducing overhead on Database servers.
  • Tuned Database queries using various Oracle tuning techniques.
  • Partitioned and sub-partitioned tables choosing appropriate partition keys in Oracle.
  • Analyzed Job Monitor, Score Dump, Resource Estimation and Performance Analysis to identify bottlenecks in DataStage jobs.
  • Lead the implementation of performance improvement recommendations.
  • Use quality stages for data.

Confidential, Glendale, WI

DataStage/ETL Lead

Responsibilities:

  • Gatheird user and business requirements through interviews, surveys, prototyping and observing.
  • Created conceptual model, business process diagrams, process and data flow diagrams and converted business requirements into logical models.
  • Converted logical models into physical models and created database script for implementing physical model into target database.
  • Used ERWIN 4.1/r7/r8 to develop conceptual, logical and physical data models and Implemented into target databases using complete compare and forward engineering.
  • Used Kimball Methodology (BUS Architecture) to design the data marts in dimensional modeling.
  • Created metadata repositories to maintain business, application, Technical and process metadata.
  • Set up four layered ETL architecture including Functional Layer, Operational and Management Layer, Audit Balancing and Control (ABC) layer and Common component layer.
  • Analyzed existing informational sources and methods to identify problem areas and make recommendations for improvement.
  • Created mapping documents for architecture and ETL purpose.
  • Designed and developed Data stage Jobs to extract data from heterogeneous sources. Applied transform logic to extracted data and loaded into Data Warehouse Database.
  • Installed and configured DataStage on Grid Environment to achieve High Availability project requirement.
  • Migrated projects to new DataStage environment installed on Grid.
  • Modified DataStage jobs to add Grid configuration related parameters and modified datasage sequence to generated one configuration file at sequence level.
  • Wrote a routine to perform maintenance of grid to clean configuration files auto-generated by Resource Manager.
  • Designed and Documented actions to be performed in case of Head node failure and Compute node failure on Grid.
  • Designed and Developed Change Data Capture (CDC) process to extract delta data from ODS.
  • Implemented Federated architecture to lookup data from MDM.
  • Used DataStage stages namely Sequential file, Transformer, Aggregate, Sort, Datasets, Join, Lookup, Change Capture, Funnel, Peek, Row Generator stages, Slowly Changing Dimension Type2, various Database stage to develop parallel jobs.
  • Used QualityStage to ensure consistency, removing data anomalies and spelling errors of the source information before being delivered for further processing.

Confidential, Boston MA

DataStage/ETL Developer

Responsibilities:

  • Developed Relational model for OLTP system, Demoralized model for Data Warehouse and Dimensional model (Star and Snow Flake) for different data marts using Erwin.
  • Identified and documented data sources and transformation rules required to populate and maintain data warehouse content and created source to target Logical Mapping Document.
  • Converted Logical Mapping Document into Source to target Physical Mapping Document for ETL.
  • Imported Metadata from various Application Sources (Database tables, flat files, XML files) into DataStage Repository.
  • Designed and developed DataStage jobs to extract data from heterogeneous sources. Applied business logic to extracted data in transformation and loaded into Data Warehouse and Data mart.
  • Worked in Develop processes for cleansing, transforming, extracting, integrating and loading data wif use of DataStage.
  • Created DataStage jobs using different stages like Transformer, Aggregator, Sort, Join, Merge, Lookup, Data Set, Funnel, Remove Duplicates, Copy, Modify, Filter, Change Data Capture, Change Apply, Sample and Surrogate Key.
  • Extensively used DataStage stages like Row Generator, Column Generator, Head, and Peek for development and de-bugging purposes.
  • Extensively used CDC (Change Data Capture) stage to implement the slowly changing Dimensional and Fact tables.
  • Extensively worked wif Job sequences using Job Activity, Email Notification, Sequencer, Wait for File activities to control and execute the Data stage Parallel jobs.
  • Created multiple configuration files and defined logical nodes, scratch disk, Resource scratch disk and pools.
  • Extensively worked wif Join, Look up (Normal and Sparse) and Merge stages.
  • Extensively worked wif sequential file, dataset, file set and look up file set stages.
  • Creation of re-usable components using shared containers for local use or shared use.
  • Used the DataStage Director and its run-time engine to schedule running the solution, testing and debugging its components, and monitoring the resulting executable versions.
  • Parameterized DataStage jobs and also created multi-instance jobs.
  • Used QualityStage Data Rules, Investigate, Match Frequency, Survive, Unduplicate Match to ensure consistency and spelling errors of the source information.
  • Developed complex store procedures using input/output parameters, cursors, views, triggers and complex queries using temp tables and joins.

Confidential, Richmond, VA

DataStage ETL Developer, Data Modeling

Responsibilities:

  • Collected the information about different Entities and attributes by studying the existing ODS and reverse engineering into Erwin.
  • Defined the Primary keys and foreign keys for the Entities.
  • Defined the query view, index options and relationships.
  • Created logical schema using ERWIN 4.0 also created Dimension Modeling for building Cubes.
  • Designed staging and Error handling tables keeping in view the overall ETL strategy.
  • Assisted in creating the physical database by forward engineering.
  • Extracted data from source systems transformed and loaded into Oracle database according to the required provision.
  • Primary on-site technical lead for data quality projects using Integrity (now non as QualityStage).
  • Created objects like tables, views, Materialized views procedures, packages using Oracle tools like PL/SQL, SQL*Plus, SQL*Loader and Handled Exceptions.
  • Involved in database development by creating Oracle PL/SQL Functions, Procedures, Triggers, Packages, Records and Collections.
  • Created views for hiding actual tables and to eliminate the complexity of the large queries.
  • Created various indexes on tables to improve the performance by eliminating the full table scans.
  • Used the DataStage Designer to develop processes for extracting, cleansing, transforming, integrating and loading data into Data Marts.
  • Created source table definitions in the DataStage Repository.
  • Identified source systems, their connectivity, related tables and fields and ensure data suitability for mapping.
  • Generated Surrogate ID’s for the dimensions in the fact table for indexed and faster access of data.
  • Created hash tables wif referential integrity for faster table look-up and for transforming the data representing valid information.
  • Used built-in as well as complex transformations.
  • Used Data Stage Manager to manage the Metadata repository and for import/export of jobs.
  • Implemented parallel extender jobs for better performance using stages like Join, Merge, Sort and Lookup.
  • Created Stored Procedures to confirm to the Business rules.
  • Used Aggregator stages to sum the key performance indicators in decision support systems and for granularity required in data warehouse.
  • Created complicated reports using reporting tool Cognos.

Confidential, Malvern, PA

Data Stage Developer

Responsibilities:

  • Identified the Facts and Dimensions using Erwin to represent the Star Schema Data Marts.
  • Responsible for documenting user requirements and translated requirements into system solutions.
  • Develop processes for cleansing, transforming, extracting, integrating, loading data wif DataStage.
  • Written complex queries to facilitate the supply of data to other teams.
  • Responsible for using different types of Stages such as FTP, Hashed File, Sequential File, Sort Aggregator, Transformer and ODBC to develop different jobs.
  • Used the DataStage Designer to design and develop jobs for extracting, cleansing, transforming, integrating, and loading data into different Data Marts.
  • Defined the data definitions, and created the target tables in the database.
  • Used the DataStage Director to schedule running the solution, testing and debugging its components, and monitoring the resulting executable versions (on an ad hoc or scheduled basis).
  • Used DataStage to transform the data to multiple stages, and prepared documentation.

Confidential, West Milford, NJ

Datastage Developer

Responsibilities:

  • Incessant usage of UNIX commands for the Sequence Jobs. Development of Datastage design concepts, execution, testing and deployment on the client server.
  • Designed and developed sequence jobs wif stages Nested Condition, Execute Command, Job activity, Sequencer, User variables activity to dynamically loop through multiple jobs.
  • Used Datastage Designer for importing the source and target database schemas, importing and exporting jobs/projects, creating new job categories and table definitions.
  • Modifying the existing Job if required.
  • Involved in performance tuning of long running jobs.
  • Used Datastage Designer for developing various jobs to extract, cleansing, transforming, integrating and loading data into Data Warehouse.
  • Part of volume testing and quality testing. Frequent usage of Clear Case version control.
  • Running and monitoring of Jobs using Datastage Director and checking logs. Unit testing for the Jobs Developed.
  • Monitoring all data loads and fixing the errors.

Environment: IBM WebSphere Datastage 8.1, Oracle 9i, DB2, SQL Server 2005, Flat Files, UNIX scripting, Windows 2003, and Autosys, Quality stage, Profile stage.

We'd love your feedback!