Big Data Developer Resume
Dearborn, MI
SUMMARY
- 6+ years of software development experience with around 4+ years of extensive experience in Data Engineering using BigData/Spark technologies.
- Strong experience building data pipelines, deploying to production, monitoring, and maintaining.
- Good experience with programming languages Scala, Java, and Python.
- Strong experience working with large datasets and designing highly scalable and optimized data modelling and data integration pipelines.
- Good understanding of distributed systems architecture and parallel computing paradigms.
- Strong experience working with Spark processing framework for performing large scale data transformations, cleansing, aggregations etc.,
- Good experience working Spark core, Spark Dataframe Api, Spark Sql and Spark Streaming Apis.
- Strong experience fine tuning long running Spark applications and troubleshooting common failures.
- Utilized various features of spark like broadcast variables, accumulators, caching/persist, dynamic allocation etc.,
- Worked on real time data integration using Kafka and Spark streaming.
- Experience developing Kafka producers and Kafka Consumers for streaming millions of events on streaming data.
- Strong experience working with various Hadoop ecosystem components like HDFS, Hive, Hbase, Sqoop, Oozie, Impala, Yarn, Hue etc.,
- Strong experience using Hive for creating centralized data warehouses and data modelling for efficient data access.
- Strong experience creating partitioned tables in Hive and bucketing for improving large join performance.
- Extensive experience utilizing AWS cloud services like S3, EMR, Redshift, Athena, Glue metastore etc., for managing and building data lakes natively on the cloud.
- Hands on experience in importing and exporting data into HDFS and Hive using Sqoop.
- Exposure on usage of NoSQL databases HBase and Cassandra.
- Extensive experienced in working with structured, semi - structured, and unstructured data by implementing complex MapReduce programs.
- Experience with design, development and maintenance of ongoing metrics, reports, analyses, dashboards, etc. using tableau, to drive key business decisions and communicate key concepts to readers.
- Experience using various Hadoop Distributions (Cloudera, Hortonworks, etc.) to fully implement and leverage new Hadoop features.
- Good exposure to other cloud providers GCP and Azure and utilized Azure Databricks for learning and experimentation.
- Strong experience as a Core Java developer for building Rest Apis and other integration applications.
- Strong Experience in working with Databases like Oracle, DB2, Teradata and MySQL and proficiency in writing complex SQL queries.
- Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.
- Experienced in involving complete SDLC life cycle includes requirements gathering, design, development, Testing, and production environments.
TECHNICAL SKILLS:
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, HBase, Cassandra, Parquet and Snappy.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR, Azure Databricks, Azure Data Lake
Languages: Java, SQL, Scala, Python
No SQL Databases: HBase and Snowflake
Methodology: Agile, waterfall
Development / Build Tools: Eclipse, Maven, IntelliJ, JUNIT and log4J.
DB Languages: MySQL, PL/SQL, PostgreSQL, and Oracle
PROFESSIONAL EXPERIENCE:
Big Data Developer
Confidential
Responsibilities:
- Developed Spark Applications to implement various data cleansing/validation and processing activity of large-scale datasets ingested from traditional data warehouse systems.
- Worked both with batch and real time streaming data sources.
- Developed custom Kafka producers to write the streaming messages from external Rest applications to Kafka topics.
- Developed spark streaming applications to consume the streaming json messages from Kafka topics.
- Developed data transformations job using Spark Data frames to flatten JSON documents.
- Worked with the Spark for improving performance and optimization of the existing transformations.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data -from Kafka in Near real time and persist it to HBase.
- Worked and learned a great deal from AWS Cloud services like EMR, S3, RDS, Redshift, Athena, Glue.
- Migrated an existing on-premises data pipelines to AWS.
- Worked on automating provisioning of AWS EMR clusters.
- Used Hive QL to analyze the partitioned and bucketed data, executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Experience in using Avro, Parquet, ORC file and JSON file formats, developed UDFs in Hive.
- Worked with Log4j framework for logging debug, info & error data.
- Used Jenkins for Continuous integration.
- Generated various kinds of reports using Tableau based on client specification.
- Used Jira for bug tracking and Git to check-in and checkout code changes.
- Responsible for generating actionable insights from complex data to drive real business results for various application teams and worked in Agile Methodology projects extensively.
- Worked with Scrum team in delivering agreed user stories on time for every Sprint.
Environment: AWS, S3, EMR, Spark, Kafka, Hive, Athena, Glue, Redshift, Teradata, Tableau
Data Engineer
Confidential, Dearborn, MI
Responsibilities:
- Ingested gigabytes of click stream data from external servers such as FTP server and S3 buckets on daily basis using customized home-grown Input Adapters.
- Created Spark JDBC ingestion jobs to import/export data from RDBMS to S3 data store.
- Developed various spark applications using Scala to perform cleansing, transformation, and enrichment of these click stream data.
- Involved in data cleansing, event enrichment, data aggregation, de-normalization and data preparation needed for machine learning and reporting.
- Troubleshooting Spark applications for improved error tolerance and reliability.
- Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Created Kafka producer API to send live stream json data into various Kafka topics.
- Developed Spark-Streaming applications to consume the data from Kafka topics and to insert the processed streams to Snowflake.
- Utilized Spark in Memory capabilities, to handle large datasets.
- Used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing.
- Experienced in working with EMR cluster and S3 in AWS cloud.
- Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
- Involved in continuous Integration of application using Jenkins.
- Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
- Followed Agile Methodologies while working on the project.
Environment: AWS EMR, Spark, Snowflake, Hive, HDFS, Sqoop, Kafka, Scala, Java, S3, CloudWatch, Aws simple workflow
Big Data/Hadoop Developer
Confidential
Responsibilities:
- Worked on installing Kafka on Virtual Machine and created topics for different users
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Responsible for importing real time data to pull the data from sources to Kafka clusters.
- Worked with spark performance improvement options like broadcasting, caching, repartitioning, and modifying the spark executor configurations for performance tuning.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Involved in migrating MapReduce jobs into Spark jobs and used Spark and Data frames API to load structured data into Spark clusters.
- Involved in using Spark API over Hadoop YARN as execution engine for data analytics using Hive and submitted the data to BI team for generating reports, after the processing and analyzing of data in Spark SQL.
- Performed SQL Joins among Hive tables to get input for Spark batch process.
- Worked with data science team to build statistical model with Spark MLLIB and Pyspark.
- Involved in performing importing data from various sources to the Cassandra cluster using Sqoop.
- Worked on creating data models for Cassandra from Existing Oracle data model.
- Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Used Sqoop to import functionality for loading Historical data present in RDBMS to HDFS
- Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Hadoop environment by Cloudera (HDP 2.2)
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Developed Oozie workflow for scheduling & orchestrating the ETL process.
- Created Data Pipelines as per the business requirements and scheduled it using Oozie Coordinators.
- Wrote Python scripts to parse XML documents and load the data in database.
- Worked extensively on Apache Nifi to build Nifi flows for the existing Oozie jobs to get the incremental load, full load, and semi structured data and to get data from Rest API into Hadoop and automate all the Nifi flows runs incrementally.
- Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
- Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
- Used version control tools like GITHUB to share the code snippet among the team members.
- Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.
Environment: Hadoop, HDFS, Hive, Python, HBase, Nifi, Spark, MYSQL, Oracle 12c, Linux, Hortonworks, Oozie, MapReduce, Sqoop, Shell Scripting, Apache Kafka, Scala, AWS.
Hadoop Developer
Confidential
Responsibilities:
- Analyzing Functional Specifications based on Project Requirement.
- Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, Kafka.
- Extended Hive core functionality by writing custom UDFs using Java.
- Developing Hive Queries for the user requirement.
- Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from Teamcenter, SAP, Workday, Machine logs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Worked on MS Sql Server PDW migration for MSBI warehouse.
- Planning, scheduling, and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.
- Worked on Solr Search Engine to index incident reports data and developed dash boards in Banana Reporting tool.
- Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.
- Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.
- Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
- Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
- Developed workflows in Live compared to Analyze SAP Data and Reporting.
- Worked on Java development and support and tools support for in house applications.
- Participated in daily scrum meetings and iterative development.
Environment: Hadoop, Hive, Sqoop, Spark, Kafka, Scala, MS SQL Server, Java.