Hadoop Developer Resume
SUMMARY
- Experience around 6+ years in IT industry with complete software development of life cycle (SDLC) which includes business requirements gathering, system analysis & design, data modeling, development, testing and implementation of the projects.
- Experience in configuration, deployments and managing of different Hadoop distributions like Cloudera (CDH4 & CDH5) and Hortonworks (HDP).
- Experience of import/export data using Sqoop from Hadoop distributed file systems to relational database systems and vice versa.
- Good knowledge in understanding the Map Reduce programs.
- Experience in Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing, and analysis of big data.
- Strong skills in querying using Hive, Pig, HBase, Spark SQL, and MongoDB.
- Hands on experience in designing ETL operations including data extraction, data cleansing, data transformations, data loading.
- Experience in optimization techniques in sorting and phase of Map reduce programs and implemented optimized joins that will join data from different data sources.
- Experience in defining job flows managing and reviewing Hadoop log files.
- Created and maintained Tables, views, procedures, functions, packages, DB triggers, and Indexes.
- Used Sqoop to import data from RDBMS into hive tables.
- Developed map reduce jobs using java to preprocess data.
- Solid understanding of all phases of development using multiple methodologies i.e. Agile with JIRA, Kanban board along with ticketing tool Remedy and Service now.
- Involved in HDFS maintenance and loading of structured and unstructured data.
- Created hive internal/external tables and worked on them using HIVE QL.
- Responsible for managing data coming from different data sources.
- Load and transform large sets of structured, semi structured, and unstructured data and Responsible to manage data coming from different sources.
- Experience in handling various file formats like AVRO, Sequential, text, xml, JSON and Parquet with different compression techniques such as gzip, LZO, Snappy etc.
- Imported the data from source HDFS into Spark Data Frame for in - memory data computation to generate the optimized output response and better visualizations.
- Experience on collection the real time streaming data and creating the pipeline for raw data from different source using Kafka and store data into HDFS and NoSQL using Spark.
- Implemented POC for using Impala for data processing on top of HIVE for better utilization.
- Knowledge in NoSQL Databases HBase, Cassandra and it's integrated with Hadoop cluster.
- Experienced with Oozie to automate the data movement between different Hadoop systems.
- Good understanding on security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure. Mentored analyst and test team for writing Hive Queries.
- Experience in writing Hive Queries for processing and analyzing large volumes of data.
- Interacted effectively with different team members of the Business Engineering, Quality Assurance and other teams involved with the System Development Life cycle
PROFESSIONAL EXPERIENCE
Hadoop Developer
Confidential
Responsibilities:
- Involved in complete Big Data flow of the application data ingestion from upstream to HDFS, processing the data in HDFS and analyzing the data using several tools.
- SDLC Requirements gathering, Analysis, Design, Development and Testing of application using AGILE and SCRUM methodology.
- Imported the data from various formats like Text, CSV, AVRO, and Parquet to HDFS cluster with compressed for optimization.
- Experience on ingesting data from RDBMS sources like - Oracle, SQL Server, and Teradata into HDFS using Sqoop
- Configured Hive and participated in writing Hive UDF's and UDAF's. Also, created partitions such as Static and Dynamic with bucketing.
- Importing and exporting data into HDFS and hive using Sqoop and Kafka with batch and streaming.
- Using Hive join queries to join multiple tables of a source system and load them into Data Lake.
- Experience in managing and reviewing huge Hadoop log files.
- Involved in HDFS maintenance and loading of structured and unstructured data.
- Implemented Data Integrity and Data Quality checks in Hadoop using Hive and Linux scripts.
- Involved in migration of the data from Oracle to Hadoop data lake using Sqoop import.
- Implemented schema extraction for Parquet and Avro file Formats in Hive.
- Created Apache Oozie workflows and coordinators to schedule and monitor various jobs including Sqoop, hive and shell script actions.
- Designed and implemented Map Reduce jobs to support distributed processing using Java, Hive, Spark SQL, and Apache Pig, Oozie.
- Implemented Spark RDD transformations, actions to migrate Map reduce algorithm using Scala and Spark SQL for faster testing and processing of data.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, and Pair RDD's.
- Automated Deployments of ETL applications that reside on top of the Hadoop Cluster
- Used Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
- Maintaining technical documentation for each step of development environment including HLD and LLD.
- Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
- Gathered the business requirements from the Business Partners and Subject Matter Experts.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Extensively used ESP workstation to schedule the Oozie jobs.
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Built the automated build and deployment framework using GitHub and Maven etc.
- Worked on BI tools as Tableau to create dashboards like weekly, monthly, daily reports using tableau desktop and publish them to HDFS cluster.
- Creating reports using tableau for business data visualization.
Environment: Hadoop, HDFS, Hive, Oozie, Sqoop, Oozie, Spark, ETL, ESP Workstation, Shell Scripting, HBase, GitHub, Tableau, Oracle, MySQL, Agile/Scrum
Hadoop Developer
Confidential, Herndon, VA
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop, Spark with Scala.
- Processed data into HDFS by developing solutions, analyzed the data using MapReduce, Pig, Hive and produce summary results from Hadoop to downstream systems.
- Used Kettle widely in order to import data from various systems/sources like MySQL into HDFS.
- Did various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in hive and Map Side joins.
- Involved in creating Hive tables, and then applied HiveQL on those tables for data validation.
- Moved the data from Hive tables into Mongo collections.
- Used Zookeeper for various types of centralized configurations.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and Pig scripts
- Managed and reviewed Hadoop log files.
- Tested raw data and executed performance scripts.
- Shared responsibility for administration of Hadoop, Hive and Pig.
- Implemented Spark using Scala, utilizing Data frames and Spark SQL API for faster processing of Batch and real time streaming data.
- Developed scripts to perform business transformations on the data using Hive and Impala for downstream applications.
- Handled large datasets using Partitions, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during ingestion process itself.
- Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDDs, and Scala.
- Created a common data lake for migrated data to be used by other members of the team.
- Implemented pre-defined operators in spark such as map, flat Map, filter, Reduce by Key, Group by Key, Aggregate By Key and Combine By Key etc.
- Worked with different file formats (Sequential, AVRO, RC, Parquet and ORC) and different Compression Codecs (gzip, snappy, lzo).
- Developed complex ETL transformation & performance tuning from flat files to databases.
- Developed complex ETL jobs from various sources such as SQL server, PostgreSQL and other files and loaded into target databases using Talend OS ETL tool.
- Created Talend jobs using the dynamic schema feature.
- Import and export of data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
- Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Working extensively on Hive, SQL, Scala, Spark, and Shell.
- Developed a data pipeline using Kafka to store data into HDFS.
- Experienced in writing Spark RDD transformations, actions for the input data and Spark-SQL queries, Data frames to import data from Data sources to perform data transformations, read/write operations using Spark-Core and save the results to output directory into HDFS.
- Responsible for design and development of Spark Applications using Scala to interact with hive and MySQL databases.
- Experience with Oozie workflow to automate and schedule daily jobs.
- Experience with job control tools like Autosys.
- Scheduling and managing cron jobs, wrote shell scripts to generate alerts.
- Hands on experience in installing, configuring, and using eco-system components like Hadoop, MapReduce, HDFS.
Environment: Linux, Eclipse, Talend, jdk1.8.0,Hadoop 2.9.0,Kafka,HDFS,Map Reduce, Hive 2.3,Kafka 2.11.2,CDH 5.4.0,Oozie-4.3.0, Sqoop 1.4.7, Tableau, Talend Open studio V (6.1.1, 6.2.1),Shell Scripting, RabbitMQ,Scala 2.12, Spark 2, Python 3.6/3.5/3.4,Maven Repository
Hadoop Developer
Confidential, Washington D.C.
Responsibilities:
- Involved in all phases of Software Development Life Cycle (SDLC) and Worked on all activities related to the development, implementation, and support for Hadoop.
- Installed and Configured Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
- Primarily involved in Data Migration process using Azure by integrating with GitHub repository and Jenkins.
- Played a key role in installation and configuration of the various Hadoop ecosystem tools such as Solr, Kafka, Pig, HBase and Cassandra.
- Implemented multiple Map Reduce Jobs in java for data cleansing and pre-processing.
- Wrote complex Hive queries and UDFs in Java and Python.
- Extensive experience in both Hadoop-1 architecture i.e., Master-Slave Architecture and Hadoop-2 Architecture i.e., YARN
- Extensive knowledge and experience on real time data streaming technologies like Kafka, Storm and Spark Streaming.
- Involved in creating Spark cluster in HDInsight by create Azure compute resources with spark installed and configured.
- Involved in implementing an HDInsight version 3.3 clusters, which is based on spark version 1.5.1.
- Good knowledge in using components that are used in cluster such as spark core (Includes Spark core, Spark SQL, Spark streaming API's.)
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Pig, and Hive.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Worked with different data sources like Avro data files, XML files, JSON files, SQL server and Oracle to load data into Hive tables.
- Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
- Involved in the process of data acquisition, data pre-processing and data exploration of telecommunication project in Scala.
- Installed Oozie workflow engine to run multiple Map Reduce, Hive HQL and Pig jobs.
- Developed HDFS with huge amounts of data using Apache Kafka.
- Collected the log data from web servers and integrated into HDFS using Flume.
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, Impala and loaded final data into HDFS
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts and Experience in managing and reviewing Hadoop log files
- Worked with cloud services like Amazon Web Services (AWS) and involved in ETL, Data Integration and Migration
- Converted all the vamp processing from Netezza and implemented by using Spark data frames and RDD's.
- Installed Oozie workflow engine to run multiple Hive and pig jobs.
- Implemented a proof of concept (Poc's) using Kafka, Strom, HBase for processing streaming data.
- Implemented a script to transmit sysprin information from Oracle to HBase using Sqoop.
Environment: Hadoop, Map Reduce, Spark, shark, Kafka, Cloudera, AWS, HDFS, Zoo Keeper, Hive, Pig, Oozie, Core Java, Eclipse, HBase, Sqoop, Netezza, EMR, Apache NIFI, Flume, Scala, Oracle 11g, Cassandra, SQL, Python, SharePoint, Azure 2015, GIT, UNIX Shell Scripting, Linux, Jenkins and Maven.