We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • Around 6 years of IT experience in a variety of industries, which includes hands - on experience in Hadoop, HDFS, Hive, Spark (PySpark), and Sqoop, Redshift, Lambda, Athena, Snowflake.
  • Experience in building ETL pipelines/ Visualizations / Analytics based quality solutions inhouse using AWS / Open-source frameworks.
  • Expertise in coding in different technologies me.e. Python, Java, and Linux shell scripting.
  • Experience in the Big Data ecosystem and its various components such as PySpark, MapReduce, HDFS, Hive, Sqoop.
  • Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop usingSparkContext,PySpark, SparkSQL, DataFrames, SparkYARN, Resource Manager, Memory Tuning, Data Serialization and Memory management.
  • Well versed experience in Amazon Web Services (AWS) Cloud services like EC2, S3, Lambda, Redshift, Athena, SNS, Snowflake.
  • Expert in writing complicated SQL Queries and database analysis for good performance.
  • Having experience in SQL Server and Oracle Database and in writing queries.
  • Experienced in performing analytics on structured and unstructured data using Hive queries.
  • In-depth understanding of MapReduce Framework andSparkexecution model.
  • Good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Experience in AWS Redshift query tuning and performance optimization
  • Experience in loading data files from HDFS to Hive for reporting.
  • Experience in writing Sqoop commands to import data from Relational databases to Hive.
  • Worked with the Agile (Scrum) team and participated in sprint planning.

TECHNICAL SKILLS

Big Data Technologies: Hadoop, HDFS, PySpark, Hive, Sqoop, AWS, EC2, S3, EMR, Lambda, Athena, Airflow, SQL, YARN

Database: Oracle 11g/12c, SQL Server, RDS

Datawarehouse Technologies: Redshift, Snowflake

NoSQL Databases: HBase, MongoDB, Cassandra

Programming Languages: Python, Java, C, JavaScript

Scripting Languages: HTML,XML,XSL, CSS, JSON, Shell Scripting

Operating System: Windows, LINUX

Libraries: xml, Selenium, NumPy, Pandas, TensorFlow, Keras, SkLearn, Matplotlib

PROFESSIONAL EXPERIENCE

Confidential, Plano, TX

Senior Data Engineer

Environment: AWS EMR, EC2, S3, IAM, PySpark, YARN, Airflow, Python & Shell scripting, GIT, Jenkins, Athena, and Snowflake

Responsibilities:

  • Worked on large data files using Pyspark (parquet format files).
  • Has good understanding of storing data in S3 buckets and perform read, transformations and actions on S3 data using Spark Dataframes and Spark SQL context on PySpark.
  • Developed a rejection framework to filter the data according to the business rules.
  • Developed an alert mechanism to alert the stake holders about data processing results in slack.
  • Written scripts in python parse the xml and JSON Data.
  • Written scripts in python to read the data from API’s.
  • Experience in creating hive external tables and good understanding the techniques of partitioning, bucketing in hive and perform joins on hive tables.
  • Involved in designing Hive schemas, using performance tuning techniques like partitioning, bucketing.
  • Build common scripts to load data into our final object in Snowflake by creating External Tables and Manage Table in snowflake which helps our product analysts and reporting developers to read and extract data from Snowflake.

Confidential, Irvine, CA

Data Engineer

Environment: AWS EMR, EC2, S3, IAM, PySpark, YARN, Airflow, Python, Redshift, Selenium, Git.

Responsibilities:

  • Experienced in working with Elastic MapReduce (EMR).
  • Worked on large data files using Pyspark (parquet format files).
  • Creating Redshift tables with different distribution styles to improve the query performance.
  • Created End-to-end ETL pipeline for data processing for downstream users using PySpark.
  • Used Spark Data Frames Operations to perform required Validations in the data
  • Extracted files from various databases through Sqoop and placed in s3 for further processing.
  • Experienced with batch processing of data sources using Pyspark.
  • Experienced designing database and datawarehouse.
  • Experience in data modelling and table design.
  • Imported data from AWS S3 into dataframes, performed transformations and actions on dataframes.
  • Worked with xml, lxml, pandas, selenium, nltk, etree libraries in python for data validation, reading, scraping.
  • Performed Spark join optimizations, troubleshot, monitored and wrote efficient codes using Pyspark.
  • Trained and led a team of associates and colleagues in understanding the functionality of framework and uplifted them technically in Hadoop, Spark and Big data technologies.
  • Used selenium library to build a python application to automate the manual work.
  • Used GitHub as common repositories for code sharing.
  • Maintained timely delivery in every sprint.
  • Built dashboards in Power BI and deployed them in Power BI report server.

Confidential, Bridgewater, NJ

Data Engineer

Environment: Hadoop, Map Reduce, HDFS, Hive, HBase, Sqoop, SQL, PySpark, AWS S3, AWS EMR, JIRA, Git.

Responsibilities:

  • Developed project using python and PySpark from the scratch
  • Experience in organizing the folder structure in HDFS as per the requirement of the incoming source file at the staging area.
  • Developed a framework to handle loading and transform large sets of unstructured data to HIVE tables.
  • Experience in hive partitioning, bucketing and performing joins on hive tables.
  • Involved with the data ingestion team for creating different data pipeline solutions for loading data from different sources (RDBMS, CSV and XML) to UNIX and further to HDFS.
  • Developed complex logic for the framework which automatically generates multiple reports based on metadata.
  • Audited the report generation status using Hive-HBase tables
  • Trained and led team associates and colleagues in understanding the functionality of framework and uplifted them technically in Hadoop, Spark and Big data technologies.
  • Implemented SCD logics while loading Hive Stage tables.
  • Analyse and prepare the design document for requirement.
  • Store the files received to landing directory in HDFS and process usingPySpark.
  • Built a monitoring dashboard to continuously monitor batch jobs.
  • Played a major role in the continuous build, test and integration process.
  • Actively participating in the code reviews, meetings and troubleshoot technical issues and solving them.
  • Attended training related to Basics of Spark RDD, Spark DataFrame and Spark-SQL.

Confidential

Application Development Associate

Environment: SQL, SQL Server, SSMS, Tableau, Python, Selenium.

Responsibilities:

  • Built out the data and reporting infrastructure using Tableau and SQL to provide real-time insights regarding the performance of interfaces and business KPIs.
  • Worked as the liaison between developers and client executives to implement technical migrations of the interface and for the requirement gathering and analysis of new interfaces.
  • Analysed performances of existing interfaces and performed code enhancements, bug fixes, and SQL query modifications to improve performance of interfaces by 40%.
  • Automated web browser activities using python selenium and windows application in order to avoid manual tasks done by team. This saved the manual effort of 6 hours per day.
  • Spearheaded weekly Scrum meetings and client co-ordination calls and functioned as Change Manager for several interface deployments into the production environment.

We'd love your feedback!