We provide IT Staff Augmentation Services!

Hadoop Developer Resume

Kansas City, MO

SUMMARY

  • Hadoop Developer with over 7+ years of experience in building big data applications, data pipelines, creating data lakes to manage structured and semi - structured data and workflow implementations using Big data ecosystems like Hadoop, Spark, Kafka etc.
  • Expertise in developing applications using Java, Scala and Python.
  • Expertise in working with Hadoop Distributions like EMR, Hortonworks, and Cloudera.
  • In-depth understanding of Hadoop Architecture and its various components such as Resource Manager, Node Manager, Applications Master, Name Node, Data Node etc.,
  • Worked extensively in real-time streaming data pipelines using Spark-Streaming, and Kafka.
  • Extensive experience writing end to end Spark Applications both using Scala and Python and utilizing Spark RDD, Spark DataFrames, Spark SQL and Spark Streaming.
  • Gained good experience troubleshooting long running jobs in Spark and fine tuning the performance bottlenecks.
  • Expertise in writing DDLs and DMLs scripts for analytics applications in Hive.
  • Experienced in Python development for various ETL and Data analytics applications as well as working with python libraries like Matplotlib, Numpy, Scipy, and Pandas for data analysis.
  • Expertise in working with AWS cloud services like EC2, S3, Redshift, EMR, Lambda, DynamoDB, RDS, Glue, and Athena for big data development.
  • Expertise in working with Hive optimization techniques like Partitioning, Bucketing, vectorizations and Map side-joins, Bucket-Map Join, skew joins.
  • Expertise in debugging and tuning failed and long-running Spark applications using various optimization techniques for executor tuning, memory management, Serialization, Broadcasting, and persisting methods assuring the optimal performance of applications.
  • Experience working with batch processing and operational data sources and migration of data from traditional databases to Hadoop and NoSQL databases.
  • Experienced with different file formats like Parquet, ORC, CSV, Text, XML, JSON, and Avro files.
  • Expertise in data ingestion using Flume, Sqoop, and Nifi.
  • Experience in orchestrating workflows using Oozie and Airflow.
  • Good Knowledge in making and keeping up profoundly versatile and fault-tolerant Infrastructure in AWS environment spanning over different availability zones.
  • Passionate about gleaning insightful information from massive datasets and developing a culture of sound, data-driven decision making.
  • I am a good team player who likes to take initiative and seek out new challenges.
  • Excellent communication skills can work in a fast-paced multitasking environment both independently and in a collaborative team, a self-motivated enthusiastic learner.
  • Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies

TECHNICAL SKILLS

Big Data Technologies: Spark, Hive, HDFS, Apache NiFI, Map Reduce, Sqoop, HBase, Oozie, Impala, and Kafka.

Hadoop Distributions: Cloudera, HDP and EMR

Languages: Java, Scala, Python and SQL

No SQL Databases: HBase, Cassandra, and MongoDB

AWS Services: EC2, EMR, Redshift, RDS, S3, AWS Lambda, CloudWatch, Glue, Athena

Databases: MySQL, Teradata, Oracle

Other tools: JIRA, GitHub, Jenkins

PROFESSIONAL EXPERIENCE

Confidential, Kansas City, MO

Hadoop Developer

Responsibilities:

  • Worked on building centralized Data lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena and Glue.
  • Hands on experience in building and deploying Spark applications for performing ETL workloads on large datasets.
  • Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
  • Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
  • Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
  • Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
  • Developed PySpark based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & AWS S3 as storage layer.
  • Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
  • Developed AWS lambdas using Python & Step functions to orchestrate data pipelines.
  • Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.
  • Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
  • Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
  • Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic.
  • Implemented a Continuous Delivery pipeline with Maven, Github and Jenkins.
  • Designed, documented operational problems by following standards and procedures using Jira.

Environment: AWS S3, EMR, Lambdas, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka, PySpark, Github, Jira.

Confidential, Deerfield, IL

Hadoop Developer

Responsibilities:

  • Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
  • Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
  • Developed PySpark and Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
  • Worked on troubleshooting spark application to make them more error tolerant.
  • Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
  • Wrote Kafka producers to stream the data from external rest API’s to Kafka topics.
  • Wrote Spark-Streaming applications to consume the data from kafka topics and write the processed streams to HBase and MongoDB.
  • Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations and other capabilities.
  • Worked extensively with Sqoop for importing data from Oracle.
  • Designing and customizing data models for Data warehouse supporting data from multiple sources on real time.
  • Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.
  • Wrote Glue jobs to migrate data from hdfs to S3 data lake.
  • Involved in creating Hive tables, loading and analyzing data using hive scripts.
  • Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
  • Good experience with continuous Integration of application using Bamboo.
  • Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
  • Collaborated with the infrastructure, network, database, application and BA teams to ensure data quality and availability.
  • Designed, documented operational problems by following standards and procedures using JIRA.

Environment: Hadoop, Spark, Scala, Python, Hive, HBase, MongoDB, Sqoop, Oozie, Kafka, Snowflake, Amazon EMR, Glue, YARN, JIRA, Amazon AWS, Shell Scripting, SBT, GITHUB, Maven

Confidential

Hadoop Developer

Responsibilities:

  • Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
  • Load the data into Spark RDD and perform in-memory data computation to generate the output as per the requirements.
  • Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyse operational data.
  • Developed Spark jobs, Hive jobs to summarize and transform data.
  • Worked on performance tuning of Spark applications to reduce job execution times.
  • Performance tuning the Spark jobs by changing the configuration properties and using broadcast variables.
  • Real time streaming the data using Spark with Kafka . Responsible for handling Streaming data from web server console logs.
  • Worked on different file formats like Text files, Avro, Parquet, JSON, XML files and Flat files using Map Reduce Programs.
  • Developed daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
  • Wrote Pig Scripts to generate transformations and performed ETL procedures on the data in HDFS .
  • Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MR jobs.
  • Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to derive business insights and solve clients operational and strategic problems.
  • Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
  • Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
  • Involved in collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • Assisted analytics team by writing Pig and Hive scripts to perform further detailed analysis of the data.
  • Designing Oozie workflows for job scheduling and batch processing.

Environment: Java, Scala, Apache Spark, MySQL, CDH, IntelliJ IDEA, Hive, HDFS, YARN, Map Reduce, Sqoop, PIG, Flume, Unix Shell Scripting, Python, Apache Kafka .

Hire Now