We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

2.00/5 (Submit Your Rating)

CA

SUMMARY

  • Having 8 years of professional Hadoop and AWS/Azure Data Engineering, Data Science and Big data implementation experience in utilizing PySpark for Ingestion, storage, querying, processing, and analysis of big data.
  • Expertise on programming in different technologies i.e., Python, Spark, SQL. Good understanding of data wrangling concepts using Pandas and Numpy.
  • Experience with Azure Data Platform stack: Azure Data Lake, Data Factory and Databricks
  • Practical experience with AWS technologies such as EC2, Lambda, EBS, EKS, ELB, VPC, IAM, ROUTE53, Autoscaling, Load Balancing, Guard Duty, AWS Shield, AWS Web Application Firewall (WAF), Network Access Control List (NACL), S3, SES, SQS, SNS, SES, AWS Glue, Quick Sight, Sage maker, Kinesis, Redshift, RDS, DynamoDB, Datadog, ElastiCache (Memcached & Redis).
  • Extracted Meta Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
  • Developed and Implemented Data Solutions utilizing Azure Services like Event Hub, Azure Data Factory, ADLS, Databricks, Azure web apps, Azure SQL DB instances.
  • Working knowledge of AWS CI/CD services such as CodeCommit, CodeBuild, CodePipeline, CodeDeploy, and creating Cloud Formation templates for infrastructure as code. Control Tower was used to create or administer our multi - account AWS infrastructure following best practices.
  • Implement AWS Lambdas to drive real-time monitoring dashboards from system logs.
  • Experienced in running spark jobs on AWS EMR and using the EMR cluster and various EC2 instance types based on requirements.
  • Developed PySpark scripts interacting with various data sources like AWS RDS, S3, Kinesis and distributed file types such as ORC, Parquet and Avro.
  • Experience with AWS Multi-Factor Authentication (MFA) for RDP/SSO logon, working with teams to lock down security groups and build specific IAM profiles per group using recently released APIs for restricting resources within AWS depending on group or user.
  • Configure Jenkins to build CI/CD pipeline which includes to trigger auto builds, auto promote builds from one environment to another, code analysis, auto version etc. for various projects.
  • Worked in highly collaborative operations team to streamline the process of implementing security Confidential Azure cloud environment and introduced best practices for remediation.
  • Hands on experience with Azure Data Lake, Azure Data Factory, Azure Blob and Azure Storage Explorer.
  • Created Splunk dashboards for CloudWatch logs and monitored the whole environment using Glass tables and worked on regular alerts.
  • Experience in using various Amazon Web Services (AWS) Components like EC2 for virtual servers, S3 and Glacier for storing objects, EBS, Cloud Front, Elastic cache and Dynamo DB for storing data.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
  • Involved in troubleshooting production related bugs and issues.
  • Experience in Configuration Management, setting up company Version policies, build schedule using SVN, Git.
  • Good experience with use-case development, with Software methodologies like Agile and Waterfall.

PROFESSIONAL EXPERIENCE

Sr Data Engineer

Confidential, CA

Responsibilities:

  • Implement AWS Lambdas to drive real-time monitoring dashboard of Kinesis streams.
  • Involved in Data Ware house design, data integration and data transformation using Apache Spark and Python.\
  • Created/Setup EMR clusters for running data engineering work loads and data scientists.
  • Experience in data warehouse modelling techniques such as Kimball modelling
  • Experience in Conceptual data modelling, Logical data modelling and Physical data modelling.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Utilized Azure Synapse and Azure Databricks to create a data pipelines in Azure.
  • Developed and Implemented Data Solutions utilizing Azure Services like Event Hub, Azure Data Factory, ADLS, Databricks, Azure web apps, Azure SQL DB instances.
  • Involved in setting up automated jobs and deploying machine learning model using Azure DevOps pipelines.
  • Involved in design and deployment of a multitude of Cloud services on AWS stack such as Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM, EC2, EMR, RDS, Redshift while focusing on high-availability, fault tolerance, and auto-scaling in AWS CloudFormation.
  • Worked in Athena, AWS Glue, Quick Sight for visualization purposes.
  • Created data pipelines using data factory and data bricks for ETL processing.
  • Retrieved data from DBFS into Spark Data Frames, for running predictive analytics on data.
  • Used Hive Context which provides a superset of the functionality provided by SQLContext and preferred to write queries using the HiveQL parser to read data from Hive tables.
  • Modelled Hive partitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
  • Hands on data engineering experience on Scala, Hadoop, EMR, spark, Kafka.
  • Knowledge in AWS, Kubernetes, Production support/ troubleshooting.
  • Experience in Exploratory Data Analysis (EDA), Feature Engineering, Data Visualization
  • Caching of RDDs for better performance and performing actions on each RDD.
  • Developed highly complex Python code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
  • Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
  • Worked on Kafka REST API to collect and load the data on Hadoop file system and used Sqoop to load the data from relational databases.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

Environment: PySpark, Hive, SQOOP, Kafka, Python, Spark streaming, DBFS, SQL Context, Spark RDD, REST API, Spark SQL, Hadoop, SQOOP, Parquet files, Oracle, SQL Server.

Data Engineer

Confidential

Responsibilities:

  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing PySpark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures and Views.
  • Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI (Key Performance Indicator) calculator Sheet and maintain that sheet within SharePoint.
  • Created reports with complex calculations, designed dashboards for analysing POS data and developed visualizations and worked on Ad-hoc reporting using Tableau.
  • Creating data model that correlates all the metrics and gives a valuable output.
  • Designed Spark based real-time data ingestion and real-time analytics, created Kafka producer to synthesize alarms using Python also used Spark-SQL to Load JSON data and create SchemaRDD and loaded it into Hive Tables and handled Structured data using Spark SQL.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python.
  • Developed data pipeline using Spark, Hive, Pig, and python to ingest customer.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts. Developed Hive and MapReduce tools to design and manage HDFS data blocks and data distribution methods.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.

Environment: SparkSQL, PySpark, SQL, RESTful Web Service, Tableau, Kafka, JSON, Hive, Pig, Hadoop, HDFS, MapReduce, S3, Redshift, AWS Data pipeline, Amazon EMR.

Data Engineer

Confidential, Evansville, IN

Responsibilities:

  • Defined data contracts, and specifications including REST APIs.
  • Worked on relational database modelling concepts in SQL, performed query performance tuning.
  • Worked on Hive Meta store backup, Partitioning and bucketing techniques in hive to improve the performance and tuning Spark Jobs
  • Responsible to build and run resilient data pipelines in production and implemented ETL/ELT to load a multi-terabyte enterprise data warehouse.
  • Worked closely with Data science team and understand the requirement clearly and create hive table on HDFS.
  • Developed Spark scripts by using python commands as per the requirement.
  • Solved performance issues in Spark with understanding of groups, joins and aggregation.
  • Scheduling Spark jobs in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.
  • Experience in using the EMR cluster and various EC2 instance types based on requirements.
  • Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hive and written Hive UDFs.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions using Python and PySpark.
  • Designed and developed Map Reduce program to analyse & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact on flight historical data.
  • Created End-to-end ETL pipeline for data processing for created dashboards to business using PySpark.
  • Developing spark programs using python APIs to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
  • Developed dataflows and processes for the Data processing using SQL (SparkSQL & Data frames).
  • Understood business requirements and prepared design documents, coding, testing and go on live production environment.
  • Implemented analytics applications using multiple database technologies, such as relational, multidimensional (OLAP), key-value, document, or graph.
  • Built cloud-native applications and supporting technologies practices including AWS, Docker, CI/CD and microservices.
  • Involved in planning process of iterations under the Agile Scrum methodology.

Environment: Hive, PySpark, HDFS, Python, EMR, EC2, UNIX, S3 files, SQL, MapReduce, ETL/ELT, Docker, REST API, Agile Scrum, OLAP (Online Analytical Processing).

Hadoop Developer

Confidential

Responsibilities:

  • Set up and built AWS infrastructure with various services available by writing cloud formation templates (CFT) in json and yaml.
  • Developed Cloud Formation scripts to build EC2 on demand
  • With the help of IAM created roles, users and groups and attached policies to provide minimum access to the resources.
  • Updating the bucket policy with IAM role to restrict the access to user.
  • Configured AWS Identity Access Management (IAM) Group and users for improved login authentication.
  • Created topics in SNS to send notifications to subscribers as per the requirement.
  • Involved in full life cycle of the project from Design, Analysis, logical and physical architecture modeling, development, Implementation, testing.
  • Moving data from Oracle to HDFS using Sqoop
  • Data profiling on critical tables from time to time to check for the abnormalities
  • Created Hive Tables, loaded transactional data from Oracle using Sqoop and Worked with highly unstructured and semi structured data.
  • Developed MapReduce (YARN) jobs for cleaning, accessing and validating the data.
  • Created and worked Sqoop jobs with incremental load to populate Hive External tables
  • Scripts were written for distribution of query for performance test jobs in Amazon Data Lake.
  • Developed optimal strategies for distributing the web log data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
  • Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system
  • Developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
  • Working on CDC (Change Data Capture) tables using Spark Application to load data into Dynamic Partition Enabled Hive Tables.
  • Designed and developed automation test scripts using Python
  • Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
  • Analyzed the SQL scripts and designed the solution to implement using Pyspark
  • Implemented Hive Generic UDF's to in corporate business logic into Hive Queries.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
  • Supporting data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud performed Export and import of data into s3.
  • Involved in designing the row key in Hbase to store Text and JSON as key values in Hbase table and designed row key in such a way to get/scan it in a sorted order.
  • Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
  • Creating Hive tables and working on them using Hive QL.
  • Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
  • Developed multiple POCs using PySpark and deployed on the YARN cluster, compared the performance of Spark, with Hive and SQL and
  • Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.

SQL Developer/Hadoop Analyst

Confidential

Responsibilities:

  • Participated in SDLC Requirements gathering, Analysis, Design, Development and Testing of application developed using AGILE methodology.
  • Developing Managed, external and partition tables as per the requirement.
  • Ingested structured data into appropriate schemas and tables to support the rule and analytics.
  • Developing custom User Defined Functions (UDF's) in Hive to transform the large volumes of data with respect to business requirement.
  • Gathered business requirements and converted them into new T-SQL stored procedures in visual studio for database project.
  • Performed unit tests on all code and packages.
  • Analyzed requirement and impact by participating in Joint Application Development sessions with business client online.
  • Performed and automated SQL Server version upgrades, patch installs and maintained relational databases.
  • Performed front line code reviews for other development teams.
  • Modified and maintained SQL Server stored procedures, views, ad-hoc queries, and SSIS packages used in the search engine optimization process.
  • Updated existing and created new reports using Microsoft SQL Server Reporting Services. Team consisted of 2 developers.
  • Created files, views, tables and data sets to support Sales Operations and Analytics teams
  • Monitored and tuned database resources and activities for SQL Server databases.
  • Developing Pig Scripts, Pig UDF's and Hive Scripts, Hive UDF's to load data files.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Involved in loading data from edge node to HDFS using shell scripting
  • Implemented scripts for loading data from UNIX file system to HDFS.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Actively participated in Object Oriented Analysis Design sessions of the Project, which is based on MVC Architecture using Spring Framework.
  • Developed the presentation layer using HTML, CSS, JSPs, BootStrap, and AngularJS.
  • Adopted J2EE design patterns like DTO, DAO, Command and Singleton.
  • Implemented Object-relation mapping in the persistence layer using hibernate framework in conjunction with spring functionality.
  • Generated POJO classes to map to the database table.
  • Configured Hibernates second level cache using EHCache to reduce the number of hits to the configuration table data.
  • ORM tool Hibernate to represent entities and fetching strategies for optimization.
  • Implementing the transaction management in the application by applying Spring Transaction and Spring AOP methodologies.
  • Written SQL queries and stored procedures for the application to communicate with Database
  • Used Junit framework for unit testing of application.
  • Used Maven to build and deploy the application.
  • Involving in client meetings and explaining the views to supporting and gathering requirements.
  • Working in an agile methodology, understand the requirements of the user stories
  • Prepared High-level design documentation for approval
  • Also, data visualization software tableau, quick sight and Kibana are used as part of bringing new insights from data extracted and better representation of data.
  • Designed data models for dynamic and real-time data with intention to be used by various applications with OLAP and OLTP.

We'd love your feedback!