Data Engineer Resume
Pittsburg, PA
SUMMARY
- 5+ years of experience in working in the data domain field which not only include data engineering but data cleaning, data wrangling, data transformation and data analysis.
- Hands - on experience in configuring and using Hadoop ecosystem components like Spark, HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, Kafka.
- Hands-on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2 instances, Glue, Athena, Redshift.
- Experience in transferring data from RDBMS to HDFS and HIVE tables using Sqoop .
- Hands on experience in Exploratory Data Analysis using numerical functions and by plotting relevant visualizations which helps for feature engineering.
- Hands-on experience developing Pyspark scripts to manage data transformations and data delivery for batch and streaming processes.
- Designed and developed streaming pipelines using Apache Kafka and Pyspark from multiple sources like APIs, data lakes for optimizing the performance of monitoring products.
- Developed Pyspark scripts in data bricks and implemented custom jobs for ETL purposes sourcing from S3 and snowflake as destination.
- Experience in creating Hive tables, partitions, buckets and queries using HiveQL to optimize performance.
- Responsible for developing data migration shell scripts in Linux environment which loads the data from DB2 to hive tables by performing necessary data transformations using big data technologies.
- Hands on experience in various Big Data application phases like data ingestion, data analytics and data visualization.
- Expertise in SQL involving Window functions, CTEs’ and manipulating date, time, conditional aggregations on data extracted.
- Experience in designing time driven and data driven automated workflow using Oozie.
- Created shell scripts to develop different pipeline architecture involving multiple jobs of python, sql, syncsort ETL and thought spot.
- Developed python scripts to consume data from APIs and transform using packages like pandas, numpy, pyjson etc.
- Hands-on experience in machine learning algorithms like Decision Trees, Support Vector Machines, K Nearest Neighbours, Linear Regression, Logistic Regression, Random Forest, Naïve Bayes Classifier, Ensemble Methods.
- Experienced in identifying business problems and solving using various machine learning algorithms.
TECHNICAL SKILLS
Big Data Ecosystem: PySpark, Spark, Kafka, Hive, Sqoop, Oozie, HDFS
Programming Languages: Python, R, Scala
Data Warehouses: Snowflake, Redshift, Hive
Cloud Services: Databricks, AWS S3, EC2, Lambda
Databases: Oracle 11g/10g/9i, MySQL, PostgreSQL
Version Control Tools: Git, GitHub
BI Tools: ThoughtSpot, Tableau, Performance Analytics
Scripting Languages: Shell Scripting
Operating Systems: Windows, Linux
ML Libraries: mlr, Scikit Learn, Spark ML
Data Analysis Libraries: Tidyverse, ggplot2, dplyr, pandas, numpy
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, Pittsburg, PA
Responsibilities:
- Worked closely with the business analysts and System Engineers to convert the Business Requirements into Technical Requirements.
- Involved in the requirements gathering, grooming, Design, Development, Unit testing and Bug fixing
- Creating the tables in Hive and integrating data between Hive and Spark.
- Developed Spark jobs to collect data from source systems and store it on HDFS to run Analytics.
- Created Hive Partitioned and Bucketed tables to improve the performance.
- Created Hive tables using user defined functions.
- Created BigQuery authorized views for row level security or exposing the data to other teams.
- Worked with google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Loaded and transformed large data sets of structured, semi-structured and unstructured data in various formats like txt, zip, XML, JSON.
- Design and implement data ingestion techniques for the data coming from various source systems.
- Designed and Developed Spark code using Python, Pyspark and spark SQL for high speed data.
- Involved in complete end to end code deployment process in Production.
- Maintained fully automated CI/CD pipelines for code deployment (Gitlab/Jenkins).
Environment: Hortonworks 2.5, HDFS 2.7.3, SPARK 2.0.0, Hive 2.0.0, YARN
Data Engineer
ConfidentialResponsibilities:
- Developed and implemented data pipelines to process 100 million records using Pyspark, AWS Lambda, APIs and stored processed data in Hive.
- Developed Pyspark scripts in databricks to collect, clean and transform data from S3 to Hive.
- Developed Lambda scripts to trigger jobs which processes and stores the data in multiple s3 buckets.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics reporting.
- Developed Sqoop scripts to import, export and update the data between HDFS and PostgreSQL.
- Automated ETL jobs using Oozie to coordinate Python, Hive, and Pyspark in AWS EC2.
- Developed Hive scripts to parse the raw data, populate staging tables and to store the redefined data in partitioned tables in the AWS S3.
- Developed python scripts to perform sensitive data masking for service now extracted data with different conditions on different views.
- Analyzed and Transformed SQL scripts into Pyspark SQL scripts for optimized and faster performance.
- Used Kafka consumer APIs from a topic to consume data every 15 minutes and land the data in S3.
- Created databases, tables and views in Hive with different access conditions to make data available for different users.
- Performed absolute quality checks to validate the extracted data is in sync with the respective destination view schema.
- Developed Tableau dashboards for end customers to understand the incidents and change order tickets being raised for a particular tech organization.
- Developed individual tableau reports to track the changes of incidents daily and resolved rate by taking snapshot data from snowflake.
Environment: Pyspark, Hive, SQL, Kafka, PysparkSQL, Sqoop, Oozie, shell scripting, Linux, Python, Tableau, AWS EC2, S3, Lambda, Servicenow.
