We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

2.00/5 (Submit Your Rating)

Plano, TX

SUMMARY

  • Around 6 years of IT experience in a variety of industries, which includes hands - on experience in Hadoop, HDFS, Hive, Spark (PySpark), and Sqoop, Redshift, Lambda, Atana, Snowflake.
  • Experience in building ETL pipelines/ Visualizations / Analytics based quality solutions inhouse using AWS / Open-source frameworks.
  • Expertise in coding in different technologies me.e. Python, Java, and Linux shell scripting.
  • Experience in teh Big Data ecosystem and its various components such as PySpark, MapReduce, HDFS, Hive, Sqoop.
  • Exploring with teh Spark for improving teh performance and optimization of teh existing algorithms in Hadoop usingSparkContext,PySpark, SparkSQL, DataFrames, SparkYARN, Resource Manager, Memory Tuning, Data Serialization and Memory management.
  • Well versed experience in Amazon Web Services (AWS) Cloud services like EC2, S3, Lambda, Redshift, Atana, SNS, Snowflake.
  • Expert in writing complicated SQL Queries and database analysis for good performance.
  • Having experience in SQL Server and Oracle Database and in writing queries.
  • Experienced in performing analytics on structured and unstructured data using Hive queries.
  • In-depth understanding of MapReduce Framework andSparkexecution model.
  • Good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Experience in AWS Redshift query tuning and performance optimization
  • Experience in loading data files from HDFS to Hive for reporting.
  • Experience in writing Sqoop commands to import data from Relational databases to Hive.
  • Worked with teh Agile (Scrum) team and participated in sprint planning.

TECHNICAL SKILLS

Big Data Technologies: Hadoop, HDFS, PySpark, Hive, Sqoop, AWS, EC2, S3, EMR, Lambda, Atana, Airflow, SQL, YARN

Database: Oracle 11g/12c, SQL Server, RDS

Datawarehouse Technologies: Redshift, Snowflake

NoSQL Databases: HBase, MongoDB, Cassandra

Programming Languages: Python, Java, C, JavaScript

Scripting Languages: HTML,XML,XSL, CSS, JSON, Shell Scripting

Operating System: Windows, LINUX

Libraries: xml, Selenium, NumPy, Pandas, TensorFlow, Keras, SkLearn, Matplotlib

PROFESSIONAL EXPERIENCE

Confidential, Plano, TX

Senior Data Engineer

Environment: AWS EMR, EC2, S3, IAM, PySpark, YARN, Airflow, Python & Shell scripting, GIT, Jenkins, Atana, and Snowflake

Responsibilities:

  • Worked on large data files using Pyspark (parquet format files).
  • Has good understanding of storing data in S3 buckets and perform read, transformations and actions on S3 data using Spark Dataframes and Spark SQL context on PySpark.
  • Developed a rejection framework to filter teh data according to teh business rules.
  • Developed an alert mechanism to alert teh stake holders about data processing results in slack.
  • Written scripts in python parse teh xml and JSON Data.
  • Written scripts in python to read teh data from API’s.
  • Experience in creating hive external tables and good understanding teh techniques of partitioning, bucketing in hive and perform joins on hive tables.
  • Involved in designing Hive schemas, using performance tuning techniques like partitioning, bucketing.
  • Build common scripts to load data into our final object in Snowflake by creating External Tables and Manage Table in snowflake which helps our product analysts and reporting developers to read and extract data from Snowflake.

Confidential, Irvine, CA

Data Engineer

Environment: AWS EMR, EC2, S3, IAM, PySpark, YARN, Airflow, Python, Redshift, Selenium, Git.

Responsibilities:

  • Experienced in working with Elastic MapReduce (EMR).
  • Worked on large data files using Pyspark (parquet format files).
  • Creating Redshift tables with different distribution styles to improve teh query performance.
  • Created End-to-end ETL pipeline for data processing for downstream users using PySpark.
  • Used Spark Data Frames Operations to perform required Validations in teh data
  • Extracted files from various databases through Sqoop and placed in s3 for further processing.
  • Experienced with batch processing of data sources using Pyspark.
  • Experienced designing database and datawarehouse.
  • Experience in data modelling and table design.
  • Imported data from AWS S3 into dataframes, performed transformations and actions on dataframes.
  • Worked with xml, lxml, pandas, selenium, nltk, etree libraries in python for data validation, reading, scraping.
  • Performed Spark join optimizations, troubleshot, monitored and wrote efficient codes using Pyspark.
  • Trained and led a team of associates and colleagues in understanding teh functionality of framework and uplifted them technically in Hadoop, Spark and Big data technologies.
  • Used selenium library to build a python application to automate teh manual work.
  • Used GitHub as common repositories for code sharing.
  • Maintained timely delivery in every sprint.
  • Built dashboards in Power BI and deployed them in Power BI report server.

Confidential, Bridgewater, NJ

Data Engineer

Environment: Hadoop, Map Reduce, HDFS, Hive, HBase, Sqoop, SQL, PySpark, AWS S3, AWS EMR, JIRA, Git.

Responsibilities:

  • Developed project using python and PySpark from teh scratch
  • Experience in organizing teh folder structure in HDFS as per teh requirement of teh incoming source file at teh staging area.
  • Developed a framework to handle loading and transform large sets of unstructured data to HIVE tables.
  • Experience in hive partitioning, bucketing and performing joins on hive tables.
  • Involved with teh data ingestion team for creating different data pipeline solutions for loading data from different sources (RDBMS, CSV and XML) to UNIX and further to HDFS.
  • Developed complex logic for teh framework which automatically generates multiple reports based on metadata.
  • Audited teh report generation status using Hive-HBase tables
  • Trained and led team associates and colleagues in understanding teh functionality of framework and uplifted them technically in Hadoop, Spark and Big data technologies.
  • Implemented SCD logics while loading Hive Stage tables.
  • Analyse and prepare teh design document for requirement.
  • Store teh files received to landing directory in HDFS and process usingPySpark.
  • Built a monitoring dashboard to continuously monitor batch jobs.
  • Played a major role in teh continuous build, test and integration process.
  • Actively participating in teh code reviews, meetings and troubleshoot technical issues and solving them.
  • Attended training related to Basics of Spark RDD, Spark DataFrame and Spark-SQL.

Confidential

Application Development Associate

Environment: SQL, SQL Server, SSMS, Tableau, Python, Selenium.

Responsibilities:

  • Built out teh data and reporting infrastructure using Tableau and SQL to provide real-time insights regarding teh performance of interfaces and business KPIs.
  • Worked as teh liaison between developers and client executives to implement technical migrations of teh interface and for teh requirement gathering and analysis of new interfaces.
  • Analysed performances of existing interfaces and performed code enhancements, bug fixes, and SQL query modifications to improve performance of interfaces by 40%.
  • Automated web browser activities using python selenium and windows application in order to avoid manual tasks done by team. This saved teh manual effort of 6 hours per day.
  • Spearheaded weekly Scrum meetings and client co-ordination calls and functioned as Change Manager for several interface deployments into teh production environment.

We'd love your feedback!