- Hadoop Developer with over 7+ years of experience in building big data applications, data pipelines, creating data lakes to manage structured and semi - structured data and workflow implementations using Big data ecosystems like Hadoop, Spark, Kafka etc.
- Expertise in developing applications using Java, Scala and Python.
- Expertise in working with Hadoop Distributions like EMR, Hortonworks, and Cloudera.
- In-depth understanding of Hadoop Architecture and its various components such as Resource Manager, Node Manager, Applications Master, Name Node, Data Node etc.,
- Worked extensively in real-time streaming data pipelines using Spark-Streaming, and Kafka.
- Extensive experience writing end to end Spark Applications both using Scala and Python and utilizing Spark RDD, Spark DataFrames, Spark SQL and Spark Streaming.
- Gained good experience troubleshooting long running jobs in Spark and fine tuning the performance bottlenecks.
- Expertise in writing DDLs and DMLs scripts for analytics applications in Hive.
- Experienced in Python development for various ETL and Data analytics applications as well as working with python libraries like Matplotlib, Numpy, Scipy, and Pandas for data analysis.
- Expertise in working with AWS cloud services like EC2, S3, Redshift, EMR, Lambda, DynamoDB, RDS, Glue, and Athena for big data development.
- Expertise in working with Hive optimization techniques like Partitioning, Bucketing, vectorizations and Map side-joins, Bucket-Map Join, skew joins.
- Expertise in debugging and tuning failed and long-running Spark applications using various optimization techniques for executor tuning, memory management, Serialization, Broadcasting, and persisting methods assuring the optimal performance of applications.
- Experience working with batch processing and operational data sources and migration of data from traditional databases to Hadoop and NoSQL databases.
- Experienced with different file formats like Parquet, ORC, CSV, Text, XML, JSON, and Avro files.
- Expertise in data ingestion using Flume, Sqoop, and Nifi.
- Experience in orchestrating workflows using Oozie and Airflow.
- Good Knowledge in making and keeping up profoundly versatile and fault-tolerant Infrastructure in AWS environment spanning over different availability zones.
- Passionate about gleaning insightful information from massive datasets and developing a culture of sound, data-driven decision making.
- I am a good team player who likes to take initiative and seek out new challenges.
- Excellent communication skills can work in a fast-paced multitasking environment both independently and in a collaborative team, a self-motivated enthusiastic learner.
- Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies
Big Data Technologies: Spark, Hive, HDFS, Apache NiFI, Map Reduce, Sqoop, HBase, Oozie, Impala, and Kafka.
Hadoop Distributions: Cloudera, HDP and EMR
Languages: Java, Scala, Python and SQL
No SQL Databases: HBase, Cassandra, and MongoDB
AWS Services: EC2, EMR, Redshift, RDS, S3, AWS Lambda, CloudWatch, Glue, Athena
Databases: MySQL, Teradata, Oracle
Other tools: JIRA, GitHub, Jenkins
Confidential, Kansas City, MO
- Worked on building centralized Data lake on AWS Cloud utilizing primary services like S3, EMR, Redshift and Athena and Glue.
- Hands on experience in building and deploying Spark applications for performing ETL workloads on large datasets.
- Built series of Spark Applications and Hive scripts to produce various analytical datasets needed for digital marketing teams.
- Worked extensively on building and automating data ingestion pipelines and moving terabytes of data from existing data warehouses to cloud.
- Worked extensively on fine tuning spark applications and providing production support to various pipelines running in production.
- Worked closely with business teams and data science teams and ensured all the requirements are translated accurately into our data pipelines.
- Developed PySpark based pipelines using spark data frame operations to load data to EDL using EMR for jobs execution & AWS S3 as storage layer.
- Worked on full spectrum of data engineering pipelines: data ingestion, data transformations and data analysis/consumption.
- Developed AWS lambdas using Python & Step functions to orchestrate data pipelines.
- Worked on automating the Infrastructure setup, launching and termination EMR clusters etc.
- Created Hive external tables on top of datasets loaded in S3 buckets and created various hive scripts to produce series of aggregated datasets for downstream analysis.
- Build real time streaming pipeline utilizing Kafka, Spark Streaming and Redshift.
- Worked on creating Kafka producers using Kafka Java Producer API for connecting to external Rest live stream application and producing messages to Kafka topic.
- Implemented a Continuous Delivery pipeline with Maven, Github and Jenkins.
- Designed, documented operational problems by following standards and procedures using Jira.
Environment: AWS S3, EMR, Lambdas, Redshift, Athena, Glue, Spark, Scala, Python, Java, Hive, Kafka, PySpark, Github, Jira.
Confidential, Deerfield, IL
- Responsible for ingesting large volumes of user behavioral data and customer profile data to Analytics Data store.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Developed PySpark and Scala based Spark applications for performing data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting teams to consume.
- Worked on troubleshooting spark application to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest API’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from kafka topics and write the processed streams to HBase and MongoDB.
- Experienced in handling large datasets using Spark in Memory capabilities, using broadcasts variables in Spark, effective & efficient Joins, transformations and other capabilities.
- Worked extensively with Sqoop for importing data from Oracle.
- Designing and customizing data models for Data warehouse supporting data from multiple sources on real time.
- Experience working for EMR cluster in AWS cloud and working with S3, Redshift, Snowflake.
- Wrote Glue jobs to migrate data from hdfs to S3 data lake.
- Involved in creating Hive tables, loading and analyzing data using hive scripts.
- Implemented Partitioning, Dynamic Partitions, Buckets in Hive.
- Good experience with continuous Integration of application using Bamboo.
- Used Reporting tools like Tableau to connect with Impala for generating daily reports of data.
- Collaborated with the infrastructure, network, database, application and BA teams to ensure data quality and availability.
- Designed, documented operational problems by following standards and procedures using JIRA.
Environment: Hadoop, Spark, Scala, Python, Hive, HBase, MongoDB, Sqoop, Oozie, Kafka, Snowflake, Amazon EMR, Glue, YARN, JIRA, Amazon AWS, Shell Scripting, SBT, GITHUB, Maven
- Involved in writing Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Load the data into Spark RDD and perform in-memory data computation to generate the output as per the requirements.
- Developed data pipelines using Spark, Hive and Sqoop to ingest, transform and analyse operational data.
- Developed Spark jobs, Hive jobs to summarize and transform data.
- Worked on performance tuning of Spark applications to reduce job execution times.
- Performance tuning the Spark jobs by changing the configuration properties and using broadcast variables.
- Real time streaming the data using Spark with Kafka . Responsible for handling Streaming data from web server console logs.
- Worked on different file formats like Text files, Avro, Parquet, JSON, XML files and Flat files using Map Reduce Programs.
- Developed daily process to do incremental import of data from DB2 and Teradata into Hive tables using Sqoop.
- Wrote Pig Scripts to generate transformations and performed ETL procedures on the data in HDFS .
- Solved performance issues in Hive and Pig scripts with understanding of Joins, Group and Aggregation and how does it translate to MR jobs.
- Work with cross functional consulting teams within the data science and analytics team to design, develop and execute solutions to derive business insights and solve clients operational and strategic problems.
- Exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
- Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
- Involved in collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Assisted analytics team by writing Pig and Hive scripts to perform further detailed analysis of the data.
- Designing Oozie workflows for job scheduling and batch processing.
Environment: Java, Scala, Apache Spark, MySQL, CDH, IntelliJ IDEA, Hive, HDFS, YARN, Map Reduce, Sqoop, PIG, Flume, Unix Shell Scripting, Python, Apache Kafka .