Data Engineer Resume
CincinnatI
PROFESSIONAL SUMMARY:
- 8+ years of experience in development, implementation and configuration of Hadoop ecosystem tools such as HDFS, YARN, Spark, Hive, Sqoop, NiFi.
- Experienced on Spark Architecture including Spark Session, Spark SQL.
- Experience in implementing Coalesce, Repartition, Cache, Broadcast variables, accumulators and other optimization techniques in Spark for best practice and performance improvement.
- Proficiency in importing and exporting data using Sqoop from Relational Database Systems to HDFS and S3.
- Experience in Amazon AWS services such as EMR, EC2, S3, Athena, Redis which provides fast and efficient processing of Big Data.
- Experience in using different columnar file formats like RC File, ORC and Parquet formats.
- Experience in using Partitions, bucketing optimization techniques in Hive and designed both Managed and External tables in Hive to improve performance.
- Procedural knowledge on cleansing and analyzing data using Hive, Hadoop Platform and on Relational databases such as Oracle, SQL, Teradata.
- Experience designing and building Data Lake solutions based on organizations needs and capabilities.
- Good understanding on Installing and maintaining the Linux servers.
- Used ETL to develop jobs for extracting, cleaning, transforming and loading data into data warehouse.
- Implemented Spark SQL to connect to Hive to read the data and distributed processing to make highly scalable.
- Excellent analytical and communication skills, with solid team working capabilities.
- Expertise in maintaining and installing packages required for Python, PySpark.
TECHNICAL SKILLS:
Big data technologies: HDFS, Map Reduce, YARN, Hive, Sqoop, Kafka, Spark, Spark SQL, NiFi
Database: Oracle, MySQL, Teradata, MS SQL, Red Shift, Snowflake, Cassandra
Programming languages: Python, Java, PySpark
IDE & Tools: Eclipse, IntelliJ, Oracle (PL/SQL), Putty
Hadoop Platform: Hortonworks, Cloudera, AWS EMR
Cloud Technologies: AWS, GCP, Azure
Scripting Languages: HTML, CSS, JSON, UNIX, Java Script
Schedulers: Airflow, Oozie
PROFESSIONAL EXPERIENCE:
Confidential, Cincinnati
Data Engineer
Responsibilities:
- Built on - premise data pipelines using Kafka and spark streaming for real time data analysis.
- Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
- Experience working for EMR cluster in AWS cloud and working with S3.
- Processed S3 data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
- Experience in loading the data from Hive to AWS S3,Red Shift using spark API.
- Involved in designing Hive schemas, using performance tuning techniques like partitioning, bucketing.
- Experience in building ETL pipelines using Apache NiFi.
- Deep understanding of various NiFi Processors.
- Developed Spark SQL scripts using PySpark to perform transformations and actions on Data frames, Data set in spark for faster data Processing.
- Extensively Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
- Implemented several Batch Ingestion jobs for Historical data migration from various relational databases and files using Sqoop.
- Experience in managing and reviewing Log files.
- Responsible for designing and building data flows using Apache Airflow to orchestrate and schedule workflows to capture data from different sources.
- Involved in the planning process of iterations under the Agile Scrum methodology.
- Used JIRA for task/bug tracking.
Environment: Spark, Spark Streaming, Kafka, AWS, Sqoop, Hive, MySQL, Oracle, Snowflake, PySpark, Zookeeper.
Confidential, Boston
Data Engineer
Responsibilities:
- Experience working for EMR cluster in AWS cloud and working with S3.
- Developed ETL frameworks for data using PySpark.
- Performed fine-tuning of Spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Experience in using different file formats like Avro, Parquet, ORC etc.,
- Extensively Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
- Involved in designing Hive schemas, using performance tuning techniques like partitioning, bucketing.
- Implemented Hive Generic UDF to in corporate business logic into Hive Queries.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
- Implemented several Batch Ingestion jobs for Historical data migration from various relational databases and files using Sqoop.
- Ingested the data from various data sources like DB2 into Hive using Sqoop scripts
- POC on building ETL pipelines using Apache NiFi.
- Deep understanding of various NiFi Processors.
- Implemented the workflows using Apache Oozie framework to automate tasks.
- Experienced in development activities in complete Agile model using JIRA and GIT.
Environment: Spark, Spark SQL, AWS, Sqoop, Hive, MySQL, Oracle, PySpark, Oozie, NiFi.
Confidential
Hadoop Developer
Responsibilities:
- Responsible for managing data coming from different RDBMS source systems like Oracle, SQL, Teradata and involved in maintaining the structured data within the HDFS in various file formats such as Parquet, Avro for optimized storage patterns.
- Improving data processing and storage throughput by using Cloudera Hadoop framework for distributed computing across a cluster of up to seventeen nodes.
- Analyzed and transformed stored data by writing Spark jobs (using windows functions such as rank, row number, lead, lag etc.) to allow downstream reporting and analytics based on business requirements.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in map reduce.
- Optimizing the Hive Queries using the various file formats like PARQUET, JSON, AVRO and Parquet.
- Worked with various HDFS file formats like Avro, ORC and various compression formats like Snappy.
- Performed performance tuning and troubleshooting of MapReduce jobs by analyzing and reviewing Hadoop log files.
- Used PySpark for extracting, cleaning, transforming and loading data into Hive data warehouse.
- Experience with Partitions, bucketing concepts in Hive and designed both using Managed and External tables in Hive.
- Experienced in handling large datasets using Partitions, spark in memory capabilities, Broadcasts in spark, effective & efficient Joins, Transformations and other during ingestion process.
- Experience writing Sqoop jobs to move data from various RDBMS into HDFS and vice versa.
- Worked with Oozie workflow engine to schedule time-based jobs to perform multiple actions.
- Developed ETL pipelines to source data to Business intelligence teams to build visualizations.
- Involved in unit testing, interface testing, system testing and user acceptance testing of the workflow Tool.
- Involved in Agile methodologies, daily scrum meetings, Spring planning's.
Environment: Oracle, MySQL, Teradata, Cloudera, Hive, Spark, Sqoop, PySpark, Oozie.