Big Data / Spark Engineer Resume
Sfo-cA
SUMMARY
- Around 7 + years of IT experience involving project development, implementation, deployment and maintenance using Big data Hadoop Ecosystem and Cloud related technologies in various sectors wif multiprogramming language expertise like Scala, Java, Python.
- 5+ years of Hadoop Developer experience in designing and implementing complete end - to-end Hadoop Infrastructure using HDFS, MapReduce, HBase, Spark, Yarn, Kafka, Zookeeper, PIG, HIVE, Sqoop, Oozie, Kudu, Flume, Kafka and Kafka Connect.
- In depth understanding of Hadoop Architecture and its various components such as Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager and Map Reduce concepts.
- Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS and vice-versa.
- Experience ofPartitions, Bucketingconcepts in Hive and designed both Managed and Externaltables in Hive to optimize performance
- Experience wif different file formats like Avro, parquet, ORC, Json and XML.
- Hands on experience wif NoSQL databases like HBase.
- Experience wif Oozie workflows and scheduler for scheduling jobs and also wif unix shell scripting.
- Experience in developing Spark applications using Scala and Python.
- Hands on experience wif different Spark API’s like Core, SQL, Streaming and Structured Streaming using Scala and Python
- Experience working different file formats like parquet, Avro, ORC, Json and XML using Spark api’s.
- Experience in developing generic frameworks for data ingestion, data processing, data cleansing and analytic frameworks using Spark.
- Experience in performance tuning of spark applications from various aspects.
- Experience wif Serializing and Deserializing different file formats using Spark.
- Experience in using accumulator variables, broad cast variables, RDD catching for spark streaming.
- Experience in consuming data from various data sources like Kafka, S3, SFTP servers etc., and stored in various data stores like HBase, Kudu, Hive Atana, DynamoDB etc.
- Experience in developing data applications using AWS services like S3, EC2, EMR, Atana, Redshift Spectrum, Redshift and DynamoDB.
- Experience working serverless services in AWS like Lambda, Glue, Data Pipeline and Step functions.
- Experience wif other Aws services like Cloud watch, Cloud Trail, Cloud formation and SNS.
- Experience wif CICD process using Git, Jenkins and other repository mangers.
- Experience in writing unit test cases using Scala Test for the code developed.
- Experience in applications of scrum, waterfall, and agile methodologies, skilled in developing processes that facilitate continual progress and team achievement.
- Worked extensively wif Data migration, Data cleansing, Data profiling and ETL Processes features for data warehouses.
- Experience in writing queries using SQL, experience in data integration, and performance tuning.
- Experience in deploying applications using Docker and Kubernetes.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, MapReduce, YARN, Hive, Pig, Sqoop, HBase, Kafka, Oozie, Impala, Spark, Spark SQL (Data Frames and Datasets).
Cloud (AWS): S3, Glue, EMR, Step Functions, Data pipeline, Atana, Redshift
Databases: Oracle, Teradata, Netezza and SQL-Server
Languages: Java, Scala, Java, SQL, Shell Scripting.
Operating systems: UNIX/Linux, Windows and Mac OS.
Tools: Maven, SBT, Jenkins, IntelliJ, Eclipse, GIT.
PROFESSIONAL EXPERIENCE
Confidential - SFO-CA
Big Data / Spark Engineer
Responsibilities:
- Design and development process to convert existing ETL pipeline into Hadoop based systems.
- Design and development process of ADF pipelines to load incremental data from Data Lake Gen 1
- Created JSON Files and Databricks Notebooks as input to ADF pipelines.
- Developed Onetime Notebook to load the history.
- Developed Spark sql code to transform the data from parquet to Delta format
- Extensively worked on Spark Scala to prepare data for building Prediction model which will be consumed by Data Science team.
- Responsible for developing spark scripts to check data quality issues in Data frames
- Developed a common framework to prepare the data to feed for the machine learning models.
- Design and performance tuning hive tables and queries from storage, file formats and query levels
- Used Hive to analyses the partitioned and bucketeddataand compute various metrics for reporting on the dashboard.
Environment: Spark Scala, Hive, AZURE: ADF, Data lake Gen1, Databricks, Snowflake
Confidential - Virginia Beach, VA
Sr Bigdata Engineer (Spark/AWS)
Responsibilities:
- Coordinated wif business and product teams to convert business requirements into technical requirements.
- Heavily involved wif migrating on-premise Hadoop systems to the cloud AWS platforms.
- Developed serverless data pipelines using S3, lambda, glue and DynamoDB
- Developed data pipelines for aggregated objects in DynamoDB using Kafka Connect, lambda and S3.
- Developed configuration driven generic framework to process data from Stage layer to Raw layer from different data sources.
- Worked extensively on Spark using Scala and deployed all applications using Data pipeline, step functions.
- Experience in automating the workflows using manifest files to trigger lambda jobs from S3 to feed it to Atana and Redshift Spectrum.
- Developing data pipelines consuming from microservices and tan landing into S3 using confluent Kafka Connect.
- Experience in integrating Glue catalogue, Glue ETL using Scala and Python and write to end datastores like Atana and DynamoDB.
- Imported tables from different RDBMS systems to HDFS using Sqoop and also used Kafka to get near real time streaming using spark streaming.
- Experience in deploying applications in deploying to docker and Kubernetes using different tools and technologies.
- Worked wif python boto3 library interacting wif different aws services.
- Develop Spark framework to convert Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Datasets and perform aggregations on data stored in memory.
- Design and implement generic application using Spark to implement CDC using window functions, joins and partitions to process complex data sets as per business requirements.
- Manage and schedule jobs by defining Hive, Spark and Sqoop actions on a Hadoop cluster using Oozie workflows and Oozie Coordinator engine
- Responsible for data extraction and data ingestion from different data sources into HDFS by creating ETL pipelines.
- Perform importing and exporting data into HDFS and Hive using Sqoop.
- Resolve Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
- Monitor and troubleshoot performance of applications and take corrective actions in-case of failures and evaluate possible enhancements to meet SLAs
Environment: Hadoop, Spark, Scala, Python, Java, Spark (Core, SQL, Streaming), AWS (Data Services), Oracle, SQL Server, Kafka, Kafka Connect, Bamboo, Git, Docker, Kubernetes, Oozie, and Linux.
Confidential - South Portland, ME
Hadoop Developer
Responsibilities:
- Responsible for designing, implementing and testing data pipelines on the cloud using AWS Services.
- Extensively used Spark to read data from S3 and preprocess it and to store in back S3 again for creating tables using Atana
- Designed and developed Spark application to read data json data from REST API’s.
- Extensively used EMR, S3, Data pipeline and Step Functions for building data pipelines
- Created partitioned tables in Atana, also designed adatawarehouse using Atana external tables and also created Atana queries for analysis.
- Responsible for designing and implementing the data pipeline using Big Data tools including Spark and Sqoop.
- Worked wif different source file formats and destination source file formats like Parquet and ORC.
- Experience in performance tuning of long running spark applications by looking into Spark UI.
- Implemented the Spark Best practices to efficiently process data to meet ETAs by utilizing features like partitioning, resource tuning, memory management and Check pointing features.
- Used versions controls tools such as GitHub to pull data from Upstream to local branch, check conflict, cleaning also reviewing the codes of otherdevelopers.
- Worked on POC for exploring cutting-edge technologies in Big Data open source tools to make existing process in efficient manner.
Environment: Atana, EMR, S3, Data pipeline, Step Functions, Sqoop, Spark, Scala, Linux, SQL Server, Data Warehouse and Tableau.
Confidential - Phoenix, AZ.
Hadoop Developer
Responsibilities:
- Responsible for developing solutions by working closely wif Solution Architects and Business teams.
- Document technical and business requirements and develop architectural diagrams.
- Developed the code for Importing and exportingdatainto HDFS and Hive and Impala using Sqoop.
- Extensive experience in working wif Hive and Impala for designing tables.
- Worked on datascience project life cycle and actively involved in phases, data acquisition, data cleansing and data preparation.
- Developed an ingestion module to ingestdatainto HDFS from heterogeneousdatasources.
- Built distributed in-memory applications using Spark and Spark SQL to do analytics efficiently on hugedatasets.
- Developed HIVE and Impala queries for the Data Transformation and Data analysis.
- Developed Oozie workflows and sub workflows to orchestrate the Sqoop scripts, hive queries and the Oozie workflows are scheduled through Autosys.
Environment: Hive, Sqoop, Spark, Python, Scala, Linux, Impala, SQL Server
Confidential -Fremont, CA
Research Graduate Assistant
Responsibilities:
- Handle the installation and configuration of a Hadoop cluster.
- Responsible for analyzing and cleansing raw data by performing Hive queries.
- Created Hive tables, loaded data and wrote hive queries that run wifin the map.
- Extracted the data from RDBMS into HDFS using Sqoop and vice versa.
- Implemented Partitioning, Dynamic Partitioning and Bucketing in Hive.
Confidential
Software Engineer
Responsibilities:
- Using JAVA developed a website for e-Recruitment consists of many modules.
- Designed and reviewed the test scenarios and scripts for given functional requirements.
- Implemented Services using Core Java.
- Involved in development of classes using java.
- Designed and built user interface using JavaScript & employed collection libraries.
- Designed and involved in preparing activity diagrams, use case diagrams, sequence diagrams as per the business requirement.
- Used JavaScript for Client validation.
- Wrote test cases in JUnit for unit testing of classes.
- Involved in creating templates and screens in HTML and JavaScript.
Environment: Java, JSP, Servlets, JDBC, JavaScript, MySQL.