We provide IT Staff Augmentation Services!

Data Engineer Resume

0/5 (Submit Your Rating)

Pittsburg, PA

SUMMARY

  • 5+ years of experience in working in the data domain field which not only include data engineering but data cleaning, data wrangling, data transformation and data analysis.
  • Hands - on experience in configuring and using Hadoop ecosystem components like Spark, HDFS, MapReduce, Hive, Pig, Sqoop, Oozie, Kafka.
  • Hands-on experience with AWS (Amazon Web Services), Elastic Map Reduce (EMR), Storage S3, EC2 instances, Glue, Athena, Redshift.
  • Experience in transferring data from RDBMS to HDFS and HIVE tables using Sqoop .
  • Hands on experience in Exploratory Data Analysis using numerical functions and by plotting relevant visualizations which helps for feature engineering.
  • Hands-on experience developing Pyspark scripts to manage data transformations and data delivery for batch and streaming processes.
  • Designed and developed streaming pipelines using Apache Kafka and Pyspark from multiple sources like APIs, data lakes for optimizing the performance of monitoring products.
  • Developed Pyspark scripts in data bricks and implemented custom jobs for ETL purposes sourcing from S3 and snowflake as destination.
  • Experience in creating Hive tables, partitions, buckets and queries using HiveQL to optimize performance.
  • Responsible for developing data migration shell scripts in Linux environment which loads the data from DB2 to hive tables by performing necessary data transformations using big data technologies.
  • Hands on experience in various Big Data application phases like data ingestion, data analytics and data visualization.
  • Expertise in SQL involving Window functions, CTEs’ and manipulating date, time, conditional aggregations on data extracted.
  • Experience in designing time driven and data driven automated workflow using Oozie.
  • Created shell scripts to develop different pipeline architecture involving multiple jobs of python, sql, syncsort ETL and thought spot.
  • Developed python scripts to consume data from APIs and transform using packages like pandas, numpy, pyjson etc.
  • Hands-on experience in machine learning algorithms like Decision Trees, Support Vector Machines, K Nearest Neighbours, Linear Regression, Logistic Regression, Random Forest, Naïve Bayes Classifier, Ensemble Methods.
  • Experienced in identifying business problems and solving using various machine learning algorithms.

TECHNICAL SKILLS

Big Data Ecosystem: PySpark, Spark, Kafka, Hive, Sqoop, Oozie, HDFS

Programming Languages: Python, R, Scala

Data Warehouses: Snowflake, Redshift, Hive

Cloud Services: Databricks, AWS S3, EC2, Lambda

Databases: Oracle 11g/10g/9i, MySQL, PostgreSQL

Version Control Tools: Git, GitHub

BI Tools: ThoughtSpot, Tableau, Performance Analytics

Scripting Languages: Shell Scripting

Operating Systems: Windows, Linux

ML Libraries: mlr, Scikit Learn, Spark ML

Data Analysis Libraries: Tidyverse, ggplot2, dplyr, pandas, numpy

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential, Pittsburg, PA

Responsibilities:

  • Worked closely with the business analysts and System Engineers to convert the Business Requirements into Technical Requirements.
  • Involved in the requirements gathering, grooming, Design, Development, Unit testing and Bug fixing
  • Creating the tables in Hive and integrating data between Hive and Spark.
  • Developed Spark jobs to collect data from source systems and store it on HDFS to run Analytics.
  • Created Hive Partitioned and Bucketed tables to improve the performance.
  • Created Hive tables using user defined functions.
  • Created BigQuery authorized views for row level security or exposing the data to other teams.
  • Worked with google data catalog and other google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
  • Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
  • Loaded and transformed large data sets of structured, semi-structured and unstructured data in various formats like txt, zip, XML, JSON.
  • Design and implement data ingestion techniques for the data coming from various source systems.
  • Designed and Developed Spark code using Python, Pyspark and spark SQL for high speed data.
  • Involved in complete end to end code deployment process in Production.
  • Maintained fully automated CI/CD pipelines for code deployment (Gitlab/Jenkins).

Environment: Hortonworks 2.5, HDFS 2.7.3, SPARK 2.0.0, Hive 2.0.0, YARN

Data Engineer

Confidential

Responsibilities:

  • Developed and implemented data pipelines to process 100 million records using Pyspark, AWS Lambda, APIs and stored processed data in Hive.
  • Developed Pyspark scripts in databricks to collect, clean and transform data from S3 to Hive.
  • Developed Lambda scripts to trigger jobs which processes and stores the data in multiple s3 buckets.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics reporting.
  • Developed Sqoop scripts to import, export and update the data between HDFS and PostgreSQL.
  • Automated ETL jobs using Oozie to coordinate Python, Hive, and Pyspark in AWS EC2.
  • Developed Hive scripts to parse the raw data, populate staging tables and to store the redefined data in partitioned tables in the AWS S3.
  • Developed python scripts to perform sensitive data masking for service now extracted data with different conditions on different views.
  • Analyzed and Transformed SQL scripts into Pyspark SQL scripts for optimized and faster performance.
  • Used Kafka consumer APIs from a topic to consume data every 15 minutes and land the data in S3.
  • Created databases, tables and views in Hive with different access conditions to make data available for different users.
  • Performed absolute quality checks to validate the extracted data is in sync with the respective destination view schema.
  • Developed Tableau dashboards for end customers to understand the incidents and change order tickets being raised for a particular tech organization.
  • Developed individual tableau reports to track the changes of incidents daily and resolved rate by taking snapshot data from snowflake.

Environment: Pyspark, Hive, SQL, Kafka, PysparkSQL, Sqoop, Oozie, shell scripting, Linux, Python, Tableau, AWS EC2, S3, Lambda, Servicenow.

We'd love your feedback!