Big Data/Hadoop Lead Resume New York NY - Hire IT People

SUMMARY:

Over 9+ years of professional IT experience, including 5+ years of Hadoop experience, in processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture
Extensively involved in all phases of Software Development Life Cycle with experience in Data Warehousing and Big Data Analytics projects
Experience in designing, Developing and implementing complete Big Data solution using Hadoop Ecosystem projects like HDFS, MapReduce, Spark, Kafka, Hive, HBase, Pig, Zoo Keeper, Sqoop, Flume, Oozie, Impala, AWS
Extensively worked with Hadoop distributions like Cloudera, EMR and BigInsights
Actively participated with Architects and Admins in Infrastructure capacity planning considering various workload patterns and design hardware for Edge node, Master node and Data nodes
Experience in installing and configuring Hadoop distribution on more than 40 node cluster
Good knowledge of Hadoop Architectures, MapReduce v1.0 and v2.0, and YARN
Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle
In depth knowledge of Distributed Computing on Hadoop platform and feature of Hadoop like Fault Tolerant, High-Availability, Disaster Recovery, Manual and Automatic Failover, Rack Awareness
Designed, Developed and Deployed application on AWS Cloud using AWS Services including EC2, IAM, S3, S3 Glacier and EMR
Extensive experience in developing PIG Latin Scripts and using Hive Query Language for data analytics.
Designed and Developed Hive tables with Partitioned and Bucketing different file formats like Text File, Parquet, Avro, RC and ORC
Extensively worked on debugging the performance bottleneck using Explain plan and Tuned the Hive queries using various techniques
Experience in importing and exporting batch from RDBMS to HDFS and vice versa using Sqoop
Performance tuned the Sqoop jobs increasing parallelism and using Sqoop2. Also wrote custom queries for import
Experience in working with Flume to transfer real-time data. Worked with various Sources, Channels and Sinks
Extensively worked with Flume Interceptors, Multiplexing and Replications
Designed and Developed the Oozie workflows to automate the jobs developed in Hadoop ecosystem projects
Hands on experience in Writing Spark Application using Python and extensively used Restfull api and Spark APIs
Extensively used Spark Core, Spark SQL and Spark Streaming to process the high volume data on Hadoop platform
Strong knowledge of Spark Architecture, Memory Management and Spark Executions
Hands on experience in Analyzing the Spark DAG using Spark WebUI to find the performance bottleneck and Tuning the job using various techniques - Serialization, Memory Configuration, Parallelism and Broadcasting
Extensively used Integrated Development environments like Eclipse and Intellij IDEA to write Applications, Git and SVN for version control and Maven/SBT for build
Extensively worked with Cloudera Manager to Administer the Hadoop cluster, to Monitor and Manage the services and Creating new user and defining their privileges
Experience in deploying Hadoop cluster on Public and Private Cloud Environment like Softlayer, Amazon AWS and Google cloud
Strong experience in writing Shell Scripts to develop Sqoop and Hive jobs
Deep understanding of Data Warehouse design methodologies recommended by Kimball and Inmon
Extensively worked on Designing and Developing Extraction, Transformation and Loading (ETL) processes
Expertise in architecting various layers of ETL like Functional Layer, Audit, Balancing and Control Layer and Operational/Management Layer and Common Component Layer
Extensively worked with Relational Database Management Systems like Oracle, Teradata, DB2, SQL Server
Hands on experience with Writing Stored Procedure, Functions and Complex SQL queries
Expertise in interacting with business users and understanding the requirement and providing solutions to match their requirement
Excellent communication and inter-personal skills, flexible and adaptive to new environments, self-motivated, team player, positive thinker and enjoy working in multicultural environment.

PROFESSIONAL EXPERIENCE:

Confidential, New York NY

Big Data/Hadoop Lead

Responsibilities:

Performed Infrastructure capacity planning for the Hadoop environments
Actively worked with Cloud vendor (Amazon) to setup Hadoop cluster as IaaS
Architected and Designed Hadoop cluster consisting of Edge nodes, Management nodes and DataNodes
Implemented AWS solutions using Virtual Private Cloud (VPC) of EC2, S3, EMR and AWS Workspace
Troubleshoot AWS EC2 Status check through System Status checks and Instance Status checks alerts and rectified if necessary
Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive
‘Big Data’ technologies such as Hadoop architecture, Map Reduce, HDFS, Yarn, Tez, Hive, Pig, HBase, Cassendra, Couch DB, MongoDB, Falcon, Atlas, Zookeeper, Sqoop, Oozie, Flume, Scala, NFS, Kafka, Storm, Sparks, Solr, Hawq, Big Sql, Mahout, etc.
Used Amazon Web Services (AWS) like EC2, S3, cloud watch and Elastic Bean Stalk for code deployment.
Provided integration between the on-premises IT environment and the AWS storage infrastructure with the help of storage gateways and Integrated data spread across multiple AWS services and analyze it from a single location by make use of AWS Data Pipeline.
Expert in writing SQL queries.
Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines and applied business logics using Spark and used Scala and Python to convert Hive/SQL queries into RDD transformations in Apache Spark
Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark and Scala/Python
Involved migrations process from Hadoop java map-reduce program to Spark-Scala APIs
Extensively used Cloudera Manager and Cloudera Director to manage cluster nodes, services, administering cluster and assigning users, groups and roles for authorization
Hands on Experience of working in Big Data technologies - Hadoop, Hive, Spark (Python and Scala), Kafka on Hortonworks, EMR and HDInsight. Can build from scratch without any vendor
Set up and configured SFTP servers to import data from third party vendors
Designed and Developed Data ingestion and source data process from various sources using Apache Sqoop, Flume and Hadoop shell commands
Designed and Developed Data Pipelines (ETL processes) in Apache Spark using Spark Core, Spark SQL and Spark Streaming
Expertise in architecting Big data solutions using Data ingestion, Data Storage
AWS Technologies such as AWS EC2, ECS, VPC, Auto scaling, Security Groups, AWS CLI, SNS containerization(Docker) of applications
Created Data Pipelines to move data from Google Cloud to Amazon Redshift database.
Extensively used Intellij IDEA as integrated development environment, Git for version control and Maven for the project build
Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)
Hands on experience in writing spark application for ETL using various Spark APIs
Extensively worked on RDD Lineage, Caching and Distributed persistence
Developed spark program to parse raw data and stored in pre-aggregated format.
Involved in converting Hive/SQL queries into Spark transformations using Spark SQL in Python
Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS. configure Hadoop environment in cloud through Amazon Web Services (AWS) and to provide a scalable distributed data solution
Expertise in analyzing Spark DAG and Job execution step in Spark Web UI
Extensively worked on tuning spark applications using various optimization techniques like Serialization, Memory Management, Broadcasting and Repartitioning
Write an automated script to create EMR cluster on AWS and process data and auto terminated.
Explore with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
Import the data from different sources like HDFS/redshift into Spark RDD.
Expertise in writing Queries and transformations in SparkSQL and Spark Core using Python.
Experience in Oozie and workflow scheduler to manage Hadoop jobs
Documenting all the processes, issues caused with solutions, resolved errors and user's activity in timely manner
Certified seamless integration of third party tool like Aginity, Tableau and SAS with Hive and Impala
Hands on experience in working with Agile methodology. Review and estimate SCRUM user stories, create tasks in JIRA.
Managed the expectations across all the key stakeholders with respect to the overall schedule, due dates, interim milestones, deliverables and dependencies

Environment: Cloudera Enterprise CDH 5.12, Redis, Presto,Json, Sparka 2.2, Scala 11.x, Hadoop, Hive, Tez, HBase, Sqoop, Linux RHEL, Oracle 11g, Shell Programming, Aginity Workbench, kafka,Tableau, SAS, Oracle SQL Developer

Confidential, Memphis, TN

Big Data Developer

Responsibilities:

Analyzed multiple sources of structured and unstructured data to propose and design data architecture solutions for scalability, high availability, fault tolerance, and elasticity
Involved in converting Cassandra/Hive/SQL queries into Spark transformations using Spark and Scala/Python
Estimated the hardware requirements for NameNode and DataNodes & planning the cluster
Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters
Data Lake Change Data Capture - Designed & Developed Change Data capture capability during data ingestion into Data Lake
AWS Redshift with Strong SQL, database design / modelling skills in AWS (Redshift / RDS)
Implemented Cloudera Distribution for Hadoop on 40 node cluster
Designed the Data Lake on Hadoop platform with Raw Zone, Processed Zone, Refined zone
Designed and Developed data ingestion framework to ingest into HDFS using Sqoop, Flume, Kafka, hdfs commands and Spark Applications
Worked with Spark eco system using Scala, Python and HIVE Queries on different data formats like Text file and parquet
Modeled the Database in Hive to migrate existing databases from RDBMS into Hive
Extensively used Hive Optimization techniques like Table Partitioning, Bucketing and optimized file formats
Developed scripts to migrate data from RDBMS to Hive and validated all the data after successful migration
Developed and Optimized Sqoop jobs for onetime data migration
Designed and developed Data Pipelines to transform Row data and Load into Hive Database
Developed Spark jobs to read data from file, applied transformation rules on RDD and Dataframes and Loaded the Refined data into Hive Database
Extensively used Single RDD Transformations, Multi-RDD Transformations, Pair RDD Transformations and RDD actions
Designed & Developed Change Data capture capability during data ingestion into Data Lake
Improved the performance of spark applications using various techniques like Broadcast Variable, Accumulators, KryoSerialization, Repartition/Coalesce
Developed Unix shell scripts to load large number of files into HDFS from Linux File System
Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
Created Oozie workflows to schedule jobs

Environment: Cloudera Enterprise CDH 5, Oracle Exadata, Presto, kafka, Tez, Linux SUSE 11, Oracle 11g, Teradata, IBM Information server suite, Business Objects, Redis, Json, Erwin 8.x, Visio 2010

Confidential, Glendale, WI

Big data /Sr. ETL Developer

Responsibilities:

Involved in understanding of business processes and coordinated with business analysts to get specific user requirements
Hands-on experience in designing and implementing solutions using Hadoop, MapReduce, HBase, Hive, Oozie
Prepared source to target mapping document according to business requirement
Prepared low level design document for ETL Process
Created ETL best practices and standard document
Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
Implemented design patterns in Scala for the application.
Documented ETL test plans, test cases, test scripts, and validations based on design specifications for unit testing, system testing, functional testing, prepared test data for testing, error handling and analysis
Used appropriate partitioning method in DataStage jobs
Implemented type 2 slowly changing dimension
Extensively used DataStage stages like Row Generator, Column Generator, Head, and Peek for development and de-bugging purposes
Extensively worked with sequential file, dataset, file set and look up file set stages
Extensively worked with Join, Look up (Normal and Sparse) and Merge stages
Created a parallel job to implement business rules and transformations
Implemented de-dup and duplicate check logic for remove duplicate
Defined Stage variables for data validations and data filtering process
Parameterized DataStage jobs and also created multi-instance jobs to achieve reusability
Extensively worked with DataStage Job Sequences to Control and Execute DataStage Jobs and Job Sequences using various Activities and Trigger
Extensively worked with Job sequences using Job Activity, Email Notification, Sequencer, Wait for File activities to control and execute the DataStage Parallel jobs
Extensively wrote Routines and Transformer functions
Worked on performance tuning to address very critical and challenging issues
Used job monitor, score dump, peak stage, performances analysis, resource estimation to tune DataStage parallel job
Implemented Audit, balance and control table for ETL process
Used DataStage Director and the runtime engine to schedule running the server jobs, monitoring scheduling and validating its components
Created multiple configuration files and defined logical nodes, scratch disk, Resource scratch disk and pools.
Wrote complex query to extract and validate data
Extensive experience in writing Transact-SQL (DDL/DML) queries
Experience to create database objects like Tables, Indexes, Views, Trigger, Stored Procedure, User Defined Functions etc.
Extensively worked to write complex SQL using joins, sub queries and function
Experience to create Indexes to improve query performance
Experience in designing, creating, processing of cubes using SSAS
Experience to set up database connectivity from DataStage to source/target database server
Wrote UNIX shell scripts to pursing dataset, FTP files
Defined UNIX -shell scripts for file watcher and file archiving process
Extensive experience in SVN
Develop a UNIX script to integrate IIS and SVN to Commit ETL code in SVN repository
Developed Autosys Jil to schedule DataStage jobs
Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis
Migrated jobs from development to QA to Production environments
Involved in component integration testing, system integration testing and UAT
Responsible for the creation and the maintenance of Analysis Services objects such as cubes, dimensions and Measures
Scheduled Cube Processing from Staging Database Tables using SQL Server Agent
Optimized cubes for better query performance
Identified the Measures and Dimensions from the Excel Sheet
Involved in release activity
Created detail task for release activity
Provided 24/7 production support in time pressured business environments

Environment: Websphere DataStage 9.1, 8.5, DB2 UDB, Teradata, SQL Server 2008/2010, SSAS, MS Access, UNIX/AIX, MSOffice Suite, Erwin 4.1, kafka, MS Visio 2010, Toad, Autosys, SVN

Confidential, Columbus, OH

Sr. DataStage ETL Developer

Responsibilities:

Interacted with business analyst to understand the business requirements and in identifying data sources
Involved in understanding the scope of application, present schema, data model and defining relationship within and between the groups of data
Involved in creating specifications for ETL processes, finalized requirements and prepared specification document
Prepaid source to target mapping documents with Transformation rules
Converted Logical Mapping Document into source to target Physical Mapping Document for ETL
Designed and Developed DataStage Jobs to Extract data from heterogeneous sources, Applied Transform Logics to extracted data and Loaded into Data Warehouse and Datamart
Extensively worked to load data into Teradata using Teradata utilities (BTEQ, FASTLOAD, FASTEXPORT, MULTILOAD, and TPUMP
Implemented start schema and snowflake schema dimension model
Developed DataStage jobs to implement slowly changing dimension
Imported Metadata from various Application Sources (Database tables, SalesForce.com, flat files, XML files) into DataStage
Defined stage variables for data validations and data filtering process
Extensively used parallel stages like Row Generator, Column Generator, Head, and Peek for development and de-bugging purposes
Extensively worked with Surrogate key generator to generate surrogate key
Created multiple configuration files with multiple nodes
Developed re-usable components using shared containers for local use or shared use
Parameterized DataStage jobs and also created multi-instance jobs to achieve reusability
Extensively wrote Routines and Transformer functions
Extensively worked with DataStage Job Sequences to Control and Execute DataStage Jobs and Job Sequences using various Activities and Trigger
Extensively used sequence job stages like job activity, Email Notification, Sequencer, Wait for File activities, start loop - end loop, execute command activity, user variable activity, routine activity etc.
Worked on performance tuning to address very critical and challenging issues
Created and modified database tables and indexes. And also granted permission to insert, update delete and select statement
Developed UNIX Shell script to run DS jobs, FTP and Name change of the files
Developed complex store procedures and queries using temp tables and joins
Designed and Developed data validation, load processes, test cases, and error control routines using PL/SQL
Performed Unit testing, Integration testing, System testing of DataStage Jobs and sequences. Also created test cases
Migrated jobs from development to QA to Production environments
Used Autosys job scheduler for automating delta run of DW cycle in both production and UAT environments
Co-ordinate with SIT and UAT team to fix the Test Problem Reports
Excellent communication, interpersonal, analytical skills and strong ability to perform as part of a team

Environment: WebSphere DataStage 8.1, 8.0(Parallel Extender), SalesForce.com, Teradata, DB2 UDB 9.0, Oracle 10g, MSOffice Suite, UNIX/AIX, Erwin 4.1, MS Visio, Toad, MS Access

Confidential, Malvern, PA

Sr. Data Stage Developer

Responsibilities:

Used IBM Datastage Designer to develop jobs for extracting, cleaning, transforming and loading data into data marts/data warehouse.
Developed several jobs to improve performance by reducing runtime using different partitioning techniques.
Used different stages of Datastage Designer like Lookup, Join, Merge, Funnel, Filter, Copy, Aggregator, and Sort etc.
Used to read complex flat files from mainframe machine buy using Complex Flat File Stage.
Sequential File, Aggregator, ODBC, Transformer, Hashed-File, Oracle OCI, XML, Folder, FTP Plug-in Stages were extensively used to develop the server jobs.
Use the EXPLAIN PLAN statement to determine the execution plan Oracle Database.
Worked on Complex data coming from Mainframes (EBCIDIC files) and knowledge of Job Control Language (JCL).
Used Cobol Copy books to import the Metadata information from mainframes.
Designed Datastage jobs using Quality Stage stages in 7.5 for data cleansing & data standardization Process. Implemented Survive stage & Match Stage for data patterns & data definitions.
Staged the data coming from various environments in staging area before into DataMarts.
Involved in writing Test Plans, Test Scenarios, Test Cases and Test Scripts and performed the Unit, Integration, system testing and User Acceptance Testing.
Used stage variables for source validations, to capture rejects and used Job Parameters for Automation of jobs.
Strong knowledge in creating procedures, functions, sequences, triggers.
Expertise in PLSQL/SQL.
Performed debugging and unit testing and System Integrated testing of the jobs.
Wrote UNIX shell script according to the business requirements.
Wrote customized server/parallel routines according to complexity of the business requirements.
Designed strategies for archiving of legacy data.
Created shell scripts to perform validations and run jobs on different instances (DEV, TEST and PROD).
Created & Deployed SSIS (SQL Server Integration Services) Projects, Schemas and Configured Report Server to generate reports through SSRS SQL Server 2005.
Used to create ad-hoc reports by MS SQL Server Reporting Services for the business users.
Used SQL Profiler to monitor the server performance, debug T-SQL and slow running queries.
Expertise in developing and debugging indexes, stored procedures, functions, triggers, cursors using T-SQL.
Wrote mapping documents for all the ETL Jobs (interfaces, Data Warehouse and Data Conversion activities).

Environment: IBM Web Sphere Data stage and Quality Stage 7.5, Ascential Datastage7.5/EE (Parallel Extender), SQL Server 2005/2008, Linux, Teradata 12, Oracle10g, Sybase, PL/SQL Toad, UNIX (HP-UX), Cognos 8 BI

Confidential

Datastage developer

Responsibilities:

Designed and developed mappings between sources and operational staging targets, using Star and Snow Flake Schemas.
Provided data models and data maps (extract, transform and load analysis) of the data marts for systems in the aggregation effort.
Involved in Extracting, cleansing, transforming, integrating and loading data into data warehouse using Datastage Designer.
Developed various transformations based on customer last name, zip code for internal business analytical purposes, loaded warehouse based on customer credit card number with dynamic data re-partitioning.
Developed user defined Routines and Transformations by using Universe Basic.
Used Datastage Manager for importing metadata from repository, job categories and creating data elements.
Used the Datastage Director and the runtime engine to schedule running the solution, testing and debugging its components and monitoring the resulting executable versions (on adhoc or scheduled basis).
Developed, maintained programs for scheduling data loading and transformations using Datastage and Oracle
Developed Shell scripts to automate file manipulation and data loading procedures.

Environment: Datastage 5.2/6.0, Oracle 8i, SQL, TOAD, UNIX, Windows NT 4.0.

We provide IT Staff Augmentation Services!

Big Data/hadoop Lead Resume

New York, NY

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship