We provide IT Staff Augmentation Services!

Big Data/hadoop Lead Resume

New York, NY


  • Over 9+ years of professional IT experience, including 5+ years of Hadoop experience, in processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture
  • Extensively involved in all phases of Software Development Life Cycle with experience in Data Warehousing and Big Data Analytics projects
  • Experience in designing, Developing and implementing complete Big Data solution using Hadoop Ecosystem projects like HDFS, MapReduce, Spark, Kafka, Hive, HBase, Pig, Zoo Keeper, Sqoop, Flume, Oozie, Impala, AWS
  • Extensively worked with Hadoop distributions like Cloudera, EMR and BigInsights
  • Actively participated with Architects and Admins in Infrastructure capacity planning considering various workload patterns and design hardware for Edge node, Master node and Data nodes
  • Experience in installing and configuring Hadoop distribution on more than 40 node cluster
  • Good knowledge of Hadoop Architectures, MapReduce v1.0 and v2.0, and YARN
  • Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle
  • In depth knowledge of Distributed Computing on Hadoop platform and feature of Hadoop like Fault Tolerant, High-Availability, Disaster Recovery, Manual and Automatic Failover, Rack Awareness
  • Designed, Developed and Deployed application on AWS Cloud using AWS Services including EC2, IAM, S3, S3 Glacier and EMR
  • Extensive experience in developing PIG Latin Scripts and using Hive Query Language for data analytics.
  • Designed and Developed Hive tables with Partitioned and Bucketing different file formats like Text File, Parquet, Avro, RC and ORC
  • Extensively worked on debugging the performance bottleneck using Explain plan and Tuned the Hive queries using various techniques
  • Experience in importing and exporting batch from RDBMS to HDFS and vice versa using Sqoop
  • Performance tuned the Sqoop jobs increasing parallelism and using Sqoop2. Also wrote custom queries for import
  • Experience in working with Flume to transfer real-time data. Worked with various Sources, Channels and Sinks
  • Extensively worked with Flume Interceptors, Multiplexing and Replications
  • Designed and Developed the Oozie workflows to automate the jobs developed in Hadoop ecosystem projects
  • Hands on experience in Writing Spark Application using Python and extensively used Restfull api and Spark APIs
  • Extensively used Spark Core, Spark SQL and Spark Streaming to process the high volume data on Hadoop platform
  • Strong knowledge of Spark Architecture, Memory Management and Spark Executions
  • Hands on experience in Analyzing the Spark DAG using Spark WebUI to find the performance bottleneck and Tuning the job using various techniques - Serialization, Memory Configuration, Parallelism and Broadcasting
  • Extensively used Integrated Development environments like Eclipse and Intellij IDEA to write Applications, Git and SVN for version control and Maven/SBT for build
  • Extensively worked with Cloudera Manager to Administer the Hadoop cluster, to Monitor and Manage the services and Creating new user and defining their privileges
  • Experience in deploying Hadoop cluster on Public and Private Cloud Environment like Softlayer, Amazon AWS and Google cloud
  • Strong experience in writing Shell Scripts to develop Sqoop and Hive jobs
  • Deep understanding of Data Warehouse design methodologies recommended by Kimball and Inmon
  • Extensively worked on Designing and Developing Extraction, Transformation and Loading (ETL) processes
  • Expertise in architecting various layers of ETL like Functional Layer, Audit, Balancing and Control Layer and Operational/Management Layer and Common Component Layer
  • Extensively worked with Relational Database Management Systems like Oracle, Teradata, DB2, SQL Server
  • Hands on experience with Writing Stored Procedure, Functions and Complex SQL queries
  • Expertise in interacting with business users and understanding the requirement and providing solutions to match their requirement
  • Excellent communication and inter-personal skills, flexible and adaptive to new environments, self-motivated, team player, positive thinker and enjoy working in multicultural environment.


Confidential, New York NY

Big Data/Hadoop Lead


  • Performed Infrastructure capacity planning for the Hadoop environments
  • Actively worked with Cloud vendor (Amazon) to setup Hadoop cluster as IaaS
  • Architected and Designed Hadoop cluster consisting of Edge nodes, Management nodes and DataNodes
  • Implemented AWS solutions using Virtual Private Cloud (VPC) of EC2, S3, EMR and AWS Workspace
  • Troubleshoot AWS EC2 Status check through System Status checks and Instance Status checks alerts and rectified if necessary
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive
  • ‘Big Data’ technologies such as Hadoop architecture, Map Reduce, HDFS, Yarn, Tez, Hive, Pig, HBase, Cassendra, Couch DB, MongoDB, Falcon, Atlas, Zookeeper, Sqoop, Oozie, Flume, Scala, NFS, Kafka, Storm, Sparks, Solr, Hawq, Big Sql, Mahout, etc.
  • Used Amazon Web Services (AWS) like EC2, S3, cloud watch and Elastic Bean Stalk for code deployment.
  • Provided integration between the on-premises IT environment and the AWS storage infrastructure with the help of storage gateways and Integrated data spread across multiple AWS services and analyze it from a single location by make use of AWS Data Pipeline.
  • Expert in writing SQL queries.
  • Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark and used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark
  • Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark and Scala/Python
  • Involved migrations process from Hadoop java map-reduce program to Spark-Scala APIs
  • Extensively used Cloudera Manager and Cloudera Director to manage cluster nodes, services, administering cluster and assigning users, groups and roles for authorization
  • Hands on Experience of working in Big Data technologies - Hadoop, Hive, Spark (Python and Scala), Kafka on Hortonworks, EMR and HDInsight. Can build from scratch without any vendor
  • Set up and configured SFTP servers to import data from third party vendors
  • Designed and Developed Data ingestion and source data process from various sources using Apache Sqoop, Flume and Hadoop shell commands
  • Designed and Developed Data Pipelines (ETL processes) in Apache Spark using Spark Core, Spark SQL and Spark Streaming
  • Expertise in architecting Big data solutions using Data ingestion, Data Storage
  • AWS Technologies such as AWS EC2, ECS, VPC, Auto scaling, Security Groups, AWS CLI, SNS containerization(Docker) of applications
  • Created Data Pipelines to move data from Google Cloud to Amazon Redshift database.
  • Extensively used Intellij IDEA as integrated development environment, Git for version control and Maven for the project build
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
  • Hands on experience in writing spark application for ETL using various Spark APIs
  • Extensively worked on RDD Lineage, Caching and Distributed persistence
  • Developed spark program to parse raw data and stored in pre-aggregated format.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark SQL in Python
  • Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS. configure Hadoop environment in cloud through Amazon Web Services (AWS) and to provide a scalable distributed data solution
  • Expertise in analyzing Spark DAG and Job execution step in Spark Web UI
  • Extensively worked on tuning spark applications using various optimization techniques like Serialization, Memory Management, Broadcasting and Repartitioning
  • Write an automated script to create EMR cluster on AWS and process data and auto terminated.
  • Explore with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
  • Import the data from different sources like HDFS/redshift into Spark RDD.
  • Expertise in writing Queries and transformations in SparkSQL and Spark Core using Python.
  • Experience in Oozie and workflow scheduler to manage Hadoop jobs
  • Documenting all the processes, issues caused with solutions, resolved errors and user's activity in timely manner
  • Certified seamless integration of third party tool like Aginity, Tableau and SAS with Hive and Impala
  • Hands on experience in working with Agile methodology. Review and estimate SCRUM user stories, create tasks in JIRA.
  • Managed the expectations across all the key stakeholders with respect to the overall schedule, due dates, interim milestones, deliverables and dependencies

Environment: Cloudera Enterprise CDH 5.12, Redis, Presto,Json, Sparka 2.2, Scala 11.x, Hadoop, Hive, Tez, HBase, Sqoop, Linux RHEL, Oracle 11g, Shell Programming, Aginity Workbench, kafka,Tableau, SAS, Oracle SQL Developer

Confidential, Memphis, TN

Big Data Developer


  • Analyzed multiple sources of structured and unstructured data to propose and design data architecture solutions for scalability, high availability, fault tolerance, and elasticity
  • Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark and Scala/Python
  • Estimated the hardware requirements for NameNode and DataNodes & planning the cluster
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters
  • Data Lake Change Data Capture - Designed & Developed Change Data capture capability during data ingestion into Data Lake
  • AWS Redshift with Strong SQL, database design / modelling skills in AWS (Redshift / RDS)
  • Implemented Cloudera Distribution for Hadoop on 40 node cluster
  • Designed the Data Lake on Hadoop platform with Raw Zone, Processed Zone, Refined zone
  • Designed and Developed data ingestion framework to ingest into HDFS using Sqoop, Flume, Kafka, hdfs commands and Spark Applications
  • Worked with Spark eco system using Scala, Python and HIVE Queries on different data formats like Text file and parquet
  • Modeled the Database in Hive to migrate existing databases from RDBMS into Hive
  • Extensively used Hive Optimization techniques like Table Partitioning, Bucketing and optimized file formats
  • Developed scripts to migrate data from RDBMS to Hive and validated all the data after successful migration
  • Developed and Optimized Sqoop jobs for onetime data migration
  • Designed and developed Data Pipelines to transform Row data and Load into Hive Database
  • Developed Spark jobs to read data from file, applied transformation rules on RDD and Dataframes and Loaded the Refined data into Hive Database
  • Extensively used Single RDD Transformations, Multi-RDD Transformations, Pair RDD Transformations and RDD actions
  • Designed & Developed Change Data capture capability during data ingestion into Data Lake
  • Improved the performance of spark applications using various techniques like Broadcast Variable, Accumulators, KryoSerialization, Repartition/Coalesce
  • Developed Unix shell scripts to load large number of files into HDFS from Linux File System
  • Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
  • Created Oozie workflows to schedule jobs

Environment: Cloudera Enterprise CDH 5, Oracle Exadata, Presto, kafka, Tez, Linux SUSE 11, Oracle 11g, Teradata, IBM Information server suite, Business Objects, Redis, Json, Erwin 8.x, Visio 2010

Confidential, Glendale, WI

Big data /Sr. ETL Developer


  • Involved in understanding of business processes and coordinated with business analysts to get specific user requirements
  • Hands-on experience in designing and implementing solutions using Hadoop, MapReduce, HBase, Hive, Oozie
  • Prepared source to target mapping document according to business requirement
  • Prepared low level design document for ETL Process
  • Created ETL best practices and standard document
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Implemented design patterns in Scala for the application.
  • Documented ETL test plans, test cases, test scripts, and validations based on design specifications for unit testing, system testing, functional testing, prepared test data for testing, error handling and analysis
  • Used appropriate partitioning method in DataStage jobs
  • Implemented type 2 slowly changing dimension
  • Extensively used DataStage stages like Row Generator, Column Generator, Head, and Peek for development and de-bugging purposes
  • Extensively worked with sequential file, dataset, file set and look up file set stages
  • Extensively worked with Join, Look up (Normal and Sparse) and Merge stages
  • Created a parallel job to implement business rules and transformations
  • Implemented de-dup and duplicate check logic for remove duplicate
  • Defined Stage variables for data validations and data filtering process
  • Parameterized DataStage jobs and also created multi-instance jobs to achieve reusability
  • Extensively worked with DataStage Job Sequences to Control and Execute DataStage Jobs and Job Sequences using various Activities and Trigger
  • Extensively worked with Job sequences using Job Activity, Email Notification, Sequencer, Wait for File activities to control and execute the DataStage Parallel jobs
  • Extensively wrote Routines and Transformer functions
  • Worked on performance tuning to address very critical and challenging issues
  • Used job monitor, score dump, peak stage, performances analysis, resource estimation to tune DataStage parallel job
  • Implemented Audit, balance and control table for ETL process
  • Used DataStage Director and the runtime engine to schedule running the server jobs, monitoring scheduling and validating its components
  • Created multiple configuration files and defined logical nodes, scratch disk, Resource scratch disk and pools.
  • Wrote complex query to extract and validate data
  • Extensive experience in writing Transact-SQL (DDL/DML) queries
  • Experience to create database objects like Tables, Indexes, Views, Trigger, Stored Procedure, User Defined Functions etc.
  • Extensively worked to write complex SQL using joins, sub queries and function
  • Experience to create Indexes to improve query performance
  • Experience in designing, creating, processing of cubes using SSAS
  • Experience to set up database connectivity from DataStage to source/target database server
  • Wrote UNIX shell scripts to pursing dataset, FTP files
  • Defined UNIX -shell scripts for file watcher and file archiving process
  • Extensive experience in SVN
  • Develop a UNIX script to integrate IIS and SVN to Commit ETL code in SVN repository
  • Developed Autosys Jil to schedule DataStage jobs
  • Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis
  • Migrated jobs from development to QA to Production environments
  • Involved in component integration testing, system integration testing and UAT
  • Responsible for the creation and the maintenance of Analysis Services objects such as cubes, dimensions and Measures
  • Scheduled Cube Processing from Staging Database Tables using SQL Server Agent
  • Optimized cubes for better query performance
  • Identified the Measures and Dimensions from the Excel Sheet
  • Involved in release activity
  • Created detail task for release activity
  • Provided 24/7 production support in time pressured business environments

Environment: Websphere DataStage 9.1, 8.5, DB2 UDB, Teradata, SQL Server 2008/2010, SSAS, MS Access, UNIX/AIX, MSOffice Suite, Erwin 4.1, kafka, MS Visio 2010, Toad, Autosys, SVN

Confidential, Columbus, OH

Sr. DataStage ETL Developer


  • Interacted with business analyst to understand the business requirements and in identifying data sources
  • Involved in understanding the scope of application, present schema, data model and defining relationship within and between the groups of data
  • Involved in creating specifications for ETL processes, finalized requirements and prepared specification document
  • Prepaid source to target mapping documents with Transformation rules
  • Converted Logical Mapping Document into source to target Physical Mapping Document for ETL
  • Designed and Developed DataStage Jobs to Extract data from heterogeneous sources, Applied Transform Logics to extracted data and Loaded into Data Warehouse and Datamart
  • Extensively worked to load data into Teradata using Teradata utilities (BTEQ, FASTLOAD, FASTEXPORT, MULTILOAD, and TPUMP
  • Implemented start schema and snowflake schema dimension model
  • Developed DataStage jobs to implement slowly changing dimension
  • Imported Metadata from various Application Sources (Database tables, SalesForce.com, flat files, XML files) into DataStage
  • Defined stage variables for data validations and data filtering process
  • Extensively used parallel stages like Row Generator, Column Generator, Head, and Peek for development and de-bugging purposes
  • Extensively worked with Surrogate key generator to generate surrogate key
  • Created multiple configuration files with multiple nodes
  • Developed re-usable components using shared containers for local use or shared use
  • Parameterized DataStage jobs and also created multi-instance jobs to achieve reusability
  • Extensively wrote Routines and Transformer functions
  • Extensively worked with DataStage Job Sequences to Control and Execute DataStage Jobs and Job Sequences using various Activities and Trigger
  • Extensively used sequence job stages like job activity, Email Notification, Sequencer, Wait for File activities, start loop - end loop, execute command activity, user variable activity, routine activity etc.
  • Worked on performance tuning to address very critical and challenging issues
  • Created and modified database tables and indexes. And also granted permission to insert, update delete and select statement
  • Developed UNIX Shell script to run DS jobs, FTP and Name change of the files
  • Developed complex store procedures and queries using temp tables and joins
  • Designed and Developed data validation, load processes, test cases, and error control routines using PL/SQL
  • Performed Unit testing, Integration testing, System testing of DataStage Jobs and sequences. Also created test cases
  • Migrated jobs from development to QA to Production environments
  • Used Autosys job scheduler for automating delta run of DW cycle in both production and UAT environments
  • Co-ordinate with SIT and UAT team to fix the Test Problem Reports
  • Excellent communication, interpersonal, analytical skills and strong ability to perform as part of a team

Environment: WebSphere DataStage 8.1, 8.0(Parallel Extender), SalesForce.com, Teradata, DB2 UDB 9.0, Oracle 10g, MSOffice Suite, UNIX/AIX, Erwin 4.1, MS Visio, Toad, MS Access

Confidential, Malvern, PA

Sr. Data Stage Developer


  • Used IBM Datastage Designer to develop jobs for extracting, cleaning, transforming and loading data into data marts/data warehouse.
  • Developed several jobs to improve performance by reducing runtime using different partitioning techniques.
  • Used different stages of Datastage Designer like Lookup, Join, Merge, Funnel, Filter, Copy, Aggregator, and Sort etc.
  • Used to read complex flat files from mainframe machine buy using Complex Flat File Stage.
  • Sequential File, Aggregator, ODBC, Transformer, Hashed-File, Oracle OCI, XML, Folder, FTP Plug-in Stages were extensively used to develop the server jobs.
  • Use the EXPLAIN PLAN statement to determine the execution plan Oracle Database.
  • Worked on Complex data coming from Mainframes (EBCIDIC files) and knowledge of Job Control Language (JCL).
  • Used Cobol Copy books to import the Metadata information from mainframes.
  • Designed Datastage jobs using Quality Stage stages in 7.5 for data cleansing & data standardization Process. Implemented Survive stage & Match Stage for data patterns & data definitions.
  • Staged the data coming from various environments in staging area before into DataMarts.
  • Involved in writing Test Plans, Test Scenarios, Test Cases and Test Scripts and performed the Unit, Integration, system testing and User Acceptance Testing.
  • Used stage variables for source validations, to capture rejects and used Job Parameters for Automation of jobs.
  • Strong knowledge in creating procedures, functions, sequences, triggers.
  • Expertise in PLSQL/SQL.
  • Performed debugging and unit testing and System Integrated testing of the jobs.
  • Wrote UNIX shell script according to the business requirements.
  • Wrote customized server/parallel routines according to complexity of the business requirements.
  • Designed strategies for archiving of legacy data.
  • Created shell scripts to perform validations and run jobs on different instances (DEV, TEST and PROD).
  • Created & Deployed SSIS (SQL Server Integration Services) Projects, Schemas and Configured Report Server to generate reports through SSRS SQL Server 2005.
  • Used to create ad-hoc reports by MS SQL Server Reporting Services for the business users.
  • Used SQL Profiler to monitor the server performance, debug T-SQL and slow running queries.
  • Expertise in developing and debugging indexes, stored procedures, functions, triggers, cursors using T-SQL.
  • Wrote mapping documents for all the ETL Jobs (interfaces, Data Warehouse and Data Conversion activities).

Environment: IBM Web Sphere Data stage and Quality Stage 7.5, Ascential Datastage7.5/EE (Parallel Extender), SQL Server 2005/2008, Linux, Teradata 12, Oracle10g, Sybase, PL/SQL Toad, UNIX (HP-UX), Cognos 8 BI


Datastage developer


  • Designed and developed mappings between sources and operational staging targets, using Star and Snow Flake Schemas.
  • Provided data models and data maps (extract, transform and load analysis) of the data marts for systems in the aggregation effort.
  • Involved in Extracting, cleansing, transforming, integrating and loading data into data warehouse using Datastage Designer.
  • Developed various transformations based on customer last name, zip code for internal business analytical purposes, loaded warehouse based on customer credit card number with dynamic data re-partitioning.
  • Developed user defined Routines and Transformations by using Universe Basic.
  • Used Datastage Manager for importing metadata from repository, job categories and creating data elements.
  • Used the Datastage Director and the runtime engine to schedule running the solution, testing and debugging its components and monitoring the resulting executable versions (on adhoc or scheduled basis).
  • Developed, maintained programs for scheduling data loading and transformations using Datastage and Oracle
  • Developed Shell scripts to automate file manipulation and data loading procedures.

Environment: Datastage 5.2/6.0, Oracle 8i, SQL, TOAD, UNIX, Windows NT 4.0.

Hire Now