Data Engineer Resume
Cincinatti, OhiO
SUMMARY
- Around 6 years of experience in the IT industry, which includes hands on experience in Big Data Ecosystem, capable of processing large sets of structured, semi structured and unstructured data and supported systems application architecture.
- Expertise on coding in different technologies i.e. Python, Unix shell scripting
- Capable of processing large sets of structured, semi - structured and unstructured data and supporting systems application architecture.
- Good understanding/knowledge of Hadoop Architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Secondary Name node, and MapReduce concepts
- Strong Knowledge on architecture and components of Spark, and efficient in working with Spark Core, SparkSQL, Spark streaming.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
- Extensive knowledge in programming with Resilient Distributed Datasets (RDDs).
- Experience in loading data files from HDFS to Hive for reporting.
- Having Knowledge in working with Google Cloud Platform components like BigQuery, Cloud Storage, Cloud SQL and Compute Engine
- Proficient in working with Jira, Bitbucket, GIT and Jenkins.
- Experienced in working with Amazon Web Services (AWS) like S3, EC2, EMR, DynamoDB, Glue, Athena and CloudFormation.
- Capable of using AWS utilities such as EMR, S3, Lambda and cloud watch to run and monitor Hadoop and spark jobs on AWS.
- Experience in ETL jobs and developing and managing data pipelines.
- Experienced in working with Hadoop distributions predominantly Cloudera (CDH) and Databricks, Azure HDInsight, GCP Dataflow and Amazon EMR.
- Strong knowledge on latest streaming tools such as Kafka, Nifi and Spark Streaming
- Good Knowledge in Unix Shell Scripting for automating deployments and other routine tasks.
- Experienced in using Integrated Development environments like Eclipse, NetBeans, IntelliJ.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, MapReduce, Apache Spark, Hive, YARN, Oozie, Zookeeper, Sqoop, Pig, HBase, AWS EMR
Cloud Services: Amazon web services (AWS), Microsoft Azure, Google cloud Platform and Snowflake
Versioning Tools: GIT
CI/CD tools: Jenkins, Ansible
Databases: Oracle, SQL server, Hive, Hbase
Programing Languages: C, Python, SQL, Java, PL/SQL
Operating system: Linux, Unix, Windows
Other tools: Databricks, Tableau, MS Visual Studio, Eclipse, IntelliJ
PROFESSIONAL EXPERIENCE
Confidential, CINCINATTI, OHIO
Data Engineer
Responsibilities:
- Evaluate, extract/transform data for analytical purpose within the context of Big data environment.
- In-depth understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames.
- Written transformations and actions on data frames, used Spark SQL on data frames to access hive tables into spark for faster processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python.
- Used Hive to do transformations, joins, filter and some pre-aggregations after storing the data to HDFS.
- Developed spark application by using Python (Pyspark) to transform data according to business rules.
- Created various hive external tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table.
- Sourced Data from various sources into Hadoop Eco system using Sqoop.
- Worked in tuning Hive to improve performance and solved performance issues in Hive with understanding of Joins, Group and aggregation and how does it translate to Map Reduce jobs.
- Implemented the workflows using Apache Oozie framework to automate tasks. Used Zookeeper to co-ordinate cluster services.
- Have used Enterprise Data Warehouse (EDW) architecture and various data modeling concepts like star schema, snowflake schema in the project.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive and involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run Map Reduce jobs in the backend.
- Worked with AWS services such as S3 for storage, EC2 and EMR for processing, and to analyse the data in S3 we used Athena and Glue.
- Created Databases and provided tables for downstream teams using Snowflake.
- Designed ETL workflows on Tableau, Deployed data from various sources to HDFS and generated reports using Tableau.
- Worked with SCRUM team in delivering agreed user stories on time for every Sprint.
Environment: Apache Hadoop, AWS, Hive, Snowflake, Kafka, SQOOP, Spark, Python, Cloudera, Tableau, HDFS, Oozie
Confidential, CHICAGO, IL
Data Engineer
Responsibilities:
- Designed and Developed data integration/engineering workflows on big data technologies and platforms (Hadoop, Spark, MapReduce, Hive, HBase).
- Involved in importing data into HDFS and Hive using Sqoop and involved in creating Hive tables, loading with data and writing Hive queries.
- Handled importing of data from various data sources, performed transformations using Spark, and loaded data into S3.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Hive and Spark tuning with partitioning/bucketing of Parquet and executors/driver's memory.
- Processed S3 data and created external tables using Hive and developed scripts to ingest and repair tables that can be reused across the project.
- Developed dataflows and processes for the Data processing using SQL (SparkSQL & Dataframes)
- Experience with AWS cloud services like EMR, EC2, S3, Athena.
- Developing spark programs using python API's to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
- Created End-to-end ETL pipeline for data processing for created dashboards to business using PySpark.
- Involved in planning process of iterations under the Agile Scrum methodology.
- Working closely with Data science team and understand the requirement clearly and create hive table on HDFS.
- Developed Spark scripts by using python commands as per the requirement.
- Solved performance issues in Spark with understanding of groups, joins and aggregation.
- Mastered in using different columnar file formats (ORC, Parquet).
- Good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.
Environment: Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, Oozie, Spark, Python, AWS, Tableau, Linux, Shell Scripting.
Confidential
Hadoop Developer
Responsibilities:
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS.
- Experience in troubleshooting the issues and failed jobs in the Hadoop cluster.
- SQL querying and performance tuning, creating backup tables.
- Implemented partitioning, dynamic partitioning and bucketing in hive
- Handled importing of data from various data sources, performed transformations using Spark and Hive in AWS EMR, and loaded data into S3.
- Experience analyzing data solving problems and troubleshooting to provide solutions. Identifying data requirements, performing data mapping and testing data integration.
- Created Hive Queries to process large sets of structured, semi-structured and unstructured data and store in Managed and External tables.
- Used Amazon AWS EMR and EC2 for cloud big data processing.
- Experience in importing the data from relational databases such as Oracle to HDFS and exporting the data from HDFS to relational databases using SQOOP.
- Deep knowledge and strong deployment experience in the Hadoop and Big Data ecosystem - Hadoop, Spark, Hive, HDFS and MapReduce.
- Worked on data pre-processing and cleaning the data to perform feature engineering and performed data imputation techniques for the missing values in the dataset using Python.
Environment: Hadoop, Spark, Hive, Sqoop,Oracle11g, SQL Server, Python, Git, Cloudera Distribution, AWS.
Confidential
SQL Developer
Responsibilities:
- Built extract, transform, and load(ETL) process to migrate data from multiple types of data sources to destination database server
- Extensively used Oracle PL/SQL language to develop complex stored packages, functions, triggers, Text queries Text Indexes etc. to process raw data and prepare for the statistical analysis.
- Identifying issues with the database and resolving them.
- Involved in data replication and high availability design scenarios with oracle streams. Developed UNIX shell scripts to automate repetitive database processes.
- Used principles of Normalization to improve the performance. Involved inETLcode usingPL/SQLto meet requirements for Extract, transformation, cleansing and loading of data from source to targetdata structures.
- Involved in Continuous enhancements and fixing of production problems. Designed, Implemented, and tuned interfaces and batch jobs using PL/SQL
- Handled errors using exception Handling extensively for the ease of debugging and displaying the errors messages in the application.
Environment: Oracle 11g, PL /SQL, Shell scripting, Linux, Unix.