Hadoop Developer / Data Engineer Resume
Chicago -, IL
SUMMARY:
- Over 8 years of IT experience in Hadoop, Data Warehousing, Linux and JAVA.
- 2 Plus years of experience in providing solutions for Big data using Hadoop 2.x, HDFS, MR2, YARN, Kafka, Spark,Scala, PIG, Hive, Tez, Sqoop, HBase, Cassandra, Zoo keeper, Oozie,UC4, Hue, CDH5 & HDP 2.x.
- Expertise in doing Sprint planning, story pointing, daily scrum, Sprint retrospective and Sprint reviews.
- Good hands on experience in developing Hadoop applications on SPARK using SCALA as a functional and object oriented programming.
- Experienced in Big data, Hadoop, NoSQL and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce2, YARN programming paradigm.
- Hands on experience in job workflow scheduling and monitoring tools like Oozie and Zookeeper
- Experience in ETL, Data warehousing and Business intelligence.
- Implementation of Big data batch processes using Hadoop Map Reduce2, YARN, Tez, PIG and Hive.
- Strong experience working with real time streaming applications and batch style large scale distributed computing applications using tools like Spark Streaming, Kafka, Flume, MapReduce, and Hive.
- Have good experience creating real time data streaming solutions using Apache Spark/Spark Streaming/Apache Storm, Kafka and Flume.
- Proficient in using various IDEs like Eclipse, My Eclipse and NetBeans
- Cloudera Manager and Ambari for installation and management/ monitoring of Hadoop cluster.
- Extensively worked on the ETL mappings, analysis and documentation of OLAP reports requirements.
- Good knowledge on Amazon AWS concepts like EMR,EC2 & S3 web services which provides fast and efficient processing of Big Data
- Trusted source for authentication in a Kerberos - enabled environment
- Good knowledge in querying data from Cassandra for searching grouping and sorting.
- Experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to Relational Database Systems and vice-versa.
- Automation of workflows and scheduling jobs using Oozie and UC4 Automata.
- Experience in Kerberos authentication protocol enable the service accounts automatically.
- Solid understanding of OLAP concepts and challenges, especially with large data sets.
- Experience in integration of various data sources like Oracle, DB2, Sybase, SQL server and MS access and non-relational sources like flat files, CSV into staging area.
- Excellent Programming skills at a higher level of abstraction using Scala and Spark
- Good understanding in processing of real-time data using Spark
- Developed small distributed applications in our projects using zookeeper and scheduled the work flows using Oozie
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive
- Good understanding of NoSQL Data bases and hands on work experience in writing applications on No SQL databases like Cassandra, HBase and Mongo DB.
- Expertise in shell scripting on UNIX platform.
- Implemented Hive and Pig custom UDF's to transform large volumes of data with respect to business requirement and achieve comprehensive data analysis.
- Implementation of Map Reduce Jobs by using Java.
- Experience in Java, J2EE technologies such as JDBC, JSP, Servlets, Hibernate, and AJAX.
- Good knowledge in Object Oriented Analysis and Design and solid understanding of Unified Modeling Language (UML).
- Experience in writing PIG scripts and Hive Queries for processing and analyzing large volumes of data.
- Experience in optimization of Map Reduce algorithm using Combiners and Petitioners to deliver best results.
- Highly proficient in Extract, Transform and Load the data into target systems using Informatica.
- Experience in managing and reviewing Hadoop log files.
- Experienced in writing complex shell scripts and schedule them using CRON to run on recurring basis.
- Hands on experience in application development using JAVA, RDBMS and Linux shell scripting.
- Strong knowledge of data warehousing, including Extract, Transform and Load Processes.
- Experience in Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment and Support).
- Using Spark streaming consumes topics from distributed messaging source Kafka and periodically pushes batch of data to Spark for real time processing.
- Strong knowledge of Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally.
- Massively Parallel Processing (MPP) database using several Postgres database instances like Teradata/Netezza and HDFS storage.
- Expertise in development support activities including installation, configuration and successful deployment of changes across all environments.
- Massively Parallel Processing to process the Hadoop jobs by using IMPALA.
- Involved in Migration, Enhancement, Maintenance and Support of project.
- Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization.
- Having good working experience in Agile/Scrum methodologies, technical discussion with client and communication using scrum calls daily for project analysis specs and development aspects.
TECHNICAL SKILLS:
Big Data Technologies: Hadoop, HDFS, MR2, Yarn, Pig, Hive, Tez, Zoo keeper, Sqoop, Kafka,Spark, HCatalog
No SQL DB s: Hbase,Mongo DB, Cassandra
Schedulers: Oozie, UC4, Autosys
Hadoop Distributions: Cloudera CDH4, Horton Works HDP 2.x.
Programming Languages: Core JAVA,J2EE
Scripting Languages: Java Scripting, Unix Shell Scripting, Python
Web Services: SOAP, Restful
Data Base: Oracle 9i/10g, MySQL, TeraData
Technologies: J2EE, JDBC, Servlets, JSP, AJAX, XML, XSL
ETL and Reporting Tools: Informatic, BODS, BO Webi, Tableau
Tools: Putty, WIN SCP, TOAD, GIT, FileZilla, SVN
IDE: Eclipse.
PROFESSIONAL EXPERIENCE:
Confidential -Chicago - IL
Hadoop Developer / Data Engineer
Responsibilities:
- Handled data imports and exports from various operational sources, performed transformations using Sqoop, Hive, Pig and MapReduce.
- Implemented partitioning, bucketing in Hive for better organization of the data.
- Involved in deploying code into version control GIT and provided support of code validation after checked in.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in hive and Map Side joins
- Design and Implementation of Batch jobs using Sqoop, MR2, PIG, Hive, Tez .
- Involved to generate the extracts in hdfs with synchronized with existing system reports.
- Implementation of ETL jobs and applying suitable data modelling techniques.
- Implemented Hive custom UDF's to transform large volumes of data with respect to business requirement and achieve comprehensive data analysis.
- Responsible for building scalable distributed data solutions using Hadoop .
- Enabled Kerberos user authentication over Hadoop Kerberos cluster.
- Automated keytab management using Ambari. No need to manually manage any keytabs during cluster configuration changes or cluster topology changes.
- Developed ETL jobs using Spark-Scala to migrate data from Oracle to new hive tables.
- Developed Spark Streaming application for real time sales analytics.
- Data Cleansing and Processing through PIG and Hive.
- Data ingestion from Teradata to HDFS using automated Sqoop scripts.
- Worked on Data Serialization formats for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV formats.
- Designed and implemented Map Reduce for distributed and parallel programming.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Creation and managing Hive warehouse to store MR results. Pig scripts for data cleaning and ETL process.
Environment: Hadoop 2.x, HDFS, Spark,MR2, YARN, PIG, HIVE, HDP2.x, Tez, Ambari, ESP, Java, GIT, Eclipse, Data Stage, ab initio, Teradata, TOAD. Cluster Configuration: 84 Node cluster, 2.2 PB of disk storage.
Confidential, Oaks, PA
Sr. Hadoop Data Engineer
Responsibilities:
- Handled data imports from various operational sources, performed transformations using Hive, Pig and MapReduce.
- Created Pig Latin scripts to support multiple data flows involving various data transformations on input data.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive
- Using Spark streaming consumes topics from distributed messaging source Kafka and periodically pushes batch of data to Spark for real time processing.
- Implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
- Involved in deploying code into version control SVN and provided support of code validation after checked in.
- Involved in pre and post production deployment support for the code developed for each release. Provided support to fix production deployment issues in support with support and configuration team.
- Involved in providing support to fix open production issues on regular basis.
- Design and Implementation of Batch jobs using MR2, PIG, Hive, Tez .
- Experience in working with job scheduler like Autosys and WFM tool.
- Involved and supported creating code for Data Ingestion - Historical, Incremental. IT flattening, curation logic creation with complex business scenarios. Developing code and performing validation to support data movement from HDFS curation files to DB2, Netezza databases for end to end business logic validation.
- Involved in creating code and supported unit testing for standardization of raw data from XML, Salesforce and JSON files with Pig.
- Used Kafka for log aggregation like collecting physical log files off servers and puts them in a central place like HDFS for processing.
- Used Kafka to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds.
- Implemented ad-hoc queries using HiveQL, created partitions to load data.
- Verified Hive Incremental updates using four-step strategy to load incremental data from RDBMS systems
- Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, Spark and loaded data into HDFS.
- Performed various data warehousing operations like de-normalization and aggregation on Hive using DML statements.
- Executed workflows in autosys to validate automated tasks of pre-processing data with Pig, loading the data into HDFS and scheduling Hadoop tasks.
- Involved in providing support of developing code for end to end creation of complex curation models and also generated business reports from curated data which supports business users to analyze the day to day business.
- Involved in performing Unit testing support as per standard dev framework.
Environment: Hadoop 2.x, HDFS, MR2, YARN, PIG, HIVE, HDP2.x, Zookeeper, Tez, Ambari, Autosys, Java, SVN, Spark, Kafka, Eclipse, Informatica.
Confidential, Dallas, TX
GWRS Hadoop Consultant
Responsibilities:
- Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioural data and financial histories into HDFS for analysis.
- Involved in writing MapReduce jobs.
- Involved in SQOOP, HDFS Put or CopyFromLocal to ingest data.
- Experienced in Pig to do transformations, event joins, filter bot traffic and some pre-aggregations before storing the data onto HDFS.
- Have hand on experience in developing Pig UDFs for the needed functionality that is not out of the box available from Apache Pig.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Good experience in developing Hive DDLs to create, alter and drop Hive TABLES.
- Involved in developing Hive UDFs for the needed functionality that is not out of the box available from Apache Hive.
- Along with the Infrastructure team, involved in design and developed Kafka and Storm based
- Data pipeline. This pipeline is also involved in Amazon Web Services EMR, S3 and RDS
- Using HCATALOG to access Hive table metadata from Map Reduce or Pig code.
- Computed various metrics using Java MapReduce to calculate metrics that define user experience, revenue etc.
- Implemented Kafka Storm topologies, which are capable of handling and channelizing high stream of data and integrating the storm topologies with Esper to filter and process that data across multiple clusters for complex event processing
- Involved in various NOSQL databases like HBase, Cassandra in implementing and integration.
- Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.
- Responsible for developing data pipeline using flume, sqoop and pig to extract the data from weblogs and store in HDFS Designed and implemented various metrics that can statistically signify the success of the experiment.
- Worked on AWS cloud to create EC2 instance and installed Java, Zookeeper and Kafka on those instances.
- Worked on S3 buckets on AWS cloud to store Cloud Formation Templates
- Experienced in Eclipse and ant to build the application.
- Involved in using SQOOP for importing and exporting data into HDFS and Hive.
- Responsible for processing ingested raw data using MapReduce, Apache Pig and Hive.
- Developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
- Involved in pivot the HDFS data from Rows to Columns and Columns to Rows.
- Involved in emitting processed data from Hadoop to relational databases or external file systems using SQOOP, HDFS GET or CopyToLocal.
- Involved in developing Shell scripts to orchestrate execution of all other scripts (Pig, Hive, and MapReduce) and move the data files within and outside of HDFS.
- Had a couple of workshops on Spark, RDD & spark-streaming.
- Developed ETL jobs using Spark-Scala to migrate data from Oracle to new Cassandra tables.
- Discussed the implementation level of concurring programing in spark using python with message passing.
- Involved in discussing spark-SQL and spark MLib
- Agile Scrum methodology.
Environment: Hadoop, HDFS, Map Reduce, Pig, Hive, Hbase, Sqoop, Flume, Oozie, Zookeeper, Kafka, Spark, Storm, AWS EMR, HDP, java, Junit testing, python, Java Script, Oracle, MySQL, NoSQL, Teradata, MongoDB, Cassandra, Tableau, LINUX and Windows.
Confidential - Minneapolis - MN
Hadoop Consultant
Responsibilities:
- Co-ordination between multiple cluster teams for business queries and migration.
- Evaluation of Hadoop platform and its eco system tools for the batch process.
- Responsible for building scalable distributed data solutions using Hadoop .
- Designed the system workflow from data extraction to reaching customers.
- Data ingestion from Teradata to HDFS using automated Sqoop scripts.
- Designed and implemented Map Reduce for distributed and parallel programming.
- Design and implementation of rules engine with regular expressions to identify the partner with high confidence.
- Creation and managing Hive warehouse to store MR results. Pig scripts for data cleaning and ETL process.
- Used UC4 and Oozie Scheduler to automate the workflows based on time and data availability.
- Involved in moving the final results into Cassandra data base for transactional and activation needs.
- Email marketing using Send Grid with required partner activation document.
- Experienced in managing and reviewing Hadoop log file.
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Used Horton works Data Platform and eBay crawler.
Environment: Hadoop 2.x - HDP 2.1, HDFS, MR, PIG, Hive, Yarn, Apache Sqoop, Oozie, UC4, Cassandra, eBay Crawler, Java, Java Mail, Rest API, Teradata, Shell Script, GIT, Rally.
Confidential
Sr Engineer (ETL Developer)
Responsibilities:
- Installed and configuring the Business Objects Data Services 4.1, with SAP BI, ECC handling SAP DS admin activities and Server configuration.
- Involved in writing validations rules and generate score cords, Data insights, Matapedia, Cleansing package builder by using Information Steward 4.x
- Configured different repositories (Local, Central, and Profiler) and job server.
- Involved in meetings with functional users, to determine the flat file, Excel layouts, data types and naming conventions of the column and table names.
- Prepared mapping documents capturing all the rules from the business.
- Created SAP BW connection to interact with SAP BODS using RFC connection.
- Created Info Object, Info Source, Info Area for SAP BW.
- Created multiple data store configurations in Data services local object library with different databases to create unified data store.
- Using Data services Created Batch and Incremental load(Change data capture) and wrote initialization scripts which control Workflows & Data flows
- Created Data Integrator mappings to load the data warehouse, the mappings involved
extensive use of simple and complex transformations like Key Generator, Table Comparison, case, Validation, Merge, lookup etc. in Data flows.
- Created Data Flows for dimension and fact tables and loaded the data into targets in SAP BW
- Tuned Data Integrator Mappings and Transformations for the better performance of jobs in
different ways like indexing the source tables and using Data transfer transformation.
- Scheduling jobs to run daily.
Environment: SAP BODS 4.1, Oracle11g, SAP ECC, SAP BW 7.3,SAP BO 4.0, Windows.