We provide IT Staff Augmentation Services!

Data Engineer Resume

4.00/5 (Submit Your Rating)

Columbus, OH

SUMMARY

  • Around 8 years of professional Hadoop and 3+ years in AWS/Azure Data Engineering, Data Science and Big data implementation experience in utilizing PySpark for Ingestion, storage, querying, processing, and analysis of big data.
  • Expertise on programming in different technologies i.e., Python, Java, SQL. Good understanding of data wrangling concepts using Pandas and Numpy.
  • Experienced in a fast - paced Safe Agile Development Environment including Test-Driven Development (TDD) and Scrum.
  • Practical experience with AWS technologies such as EC2, Lambda, EBS, EKS, ELB, VPC, IAM, ROUTE53, Autoscaling, Load Balancing, Guard Duty, AWS Shield, AWS Web Application Firewall (WAF), Network Access Control List (NACL), S3, SES, SQS, SNS, SES, AWS Glue, Quick Sight, Sage maker, Kinesis, Redshift, RDS, RDBMS, Dynamo DB, Data dog, Elastic ache (Memcache & Redis).
  • Extracted Meta Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
  • Working knowledge of AWS CI/CD services such as Code Commit, Code Build, Code Pipeline, Code Deploy, and creating Cloud Formation templates for infrastructure as code. Control Tower was used to create or administer our multi-account AWS infrastructure following best practices.
  • Developed and Implemented Data Solutions utilizing Azure Services like Event Hub,AzureData Factory,
  • ADLS, Databricks,Azureweb apps,AzureSQL DB instances.
  • Implement AWS Lambdas to drive real-time monitoring dashboards from system logs.
  • Experienced in running spark jobs on AWS EMR and using the EMR cluster and various EC2 instance types based on requirements.
  • Developed PySpark scripts interacting with various data sources like AWS RDS, RDBMS, S3, Kinesis and distributed file types such as ORC, Parquet and Avro.
  • Experience with AWS Multi-Factor Authentication (MFA) for RDP/SSO logon, working with teams to lock down security groups and build specific IAM profiles per group using recently released APIs for restricting resources within AWS depending on group or user.
  • Configure Jenkins to build CI/CD pipeline which includes to trigger auto builds, auto promote builds from one environment to another, code analysis, auto version etc. for various projects.
  • Worked in highly collaborative operations team to streamline the process of implementing security Confidential Azure cloud environment and introduced best practices for remediation
  • Created Splunk dashboards for CloudWatch logs and monitored the whole environment using Glass tables and worked on regular alerts.
  • Experience in performing data analysis required to troubleshoot data related issues and assist in the resolution of data issues
  • Experience in using various Amazon Web Services (AWS) Components like EC2 for virtual servers, S3 and Glacier for storing objects, EBS, Cloud Front, Elastic cache and Dynamo DB for storing data.
  • Hands on experience with Azure Data Lake, Azure Data Factory, Azure Blob and Azure Storage Explorer.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in Configuration Management, setting up company Version policies, build schedule using SVN, Git.
  • Good experience with use-case development, with Software methodologies like Agile and Waterfall.

TECHNICAL SKILLS

Cloud: AWS, Azure (Azure Databricks)

Testing methods: Selenium 2.0, HP QTP 11.0, and SOAP UI.

RDBMS: Oracle, SQL Server, DB2, MySQL, PGADMIN, RedShift, Cosmos DB

Languages: Apache Spark, Python, SQL, PL/SQL, HTML, DHTML, UML

Version tools: SVN, GIT

Automation Tools: Jenkins, Azure DevOps, Code Pipeline

Scripting languages: Python, shell Scripting, PowerShell Scripting YAML, JSON

Agile Tool: JIRA

Infrastructure as Code: CloudFormation, Terraform

PROFESSIONAL EXPERIENCE

Confidential - Columbus, OH

Data Engineer

Responsibilities:

  • Implement AWS Lambdas to drive real-time monitoring dashboard of Kinesis streams.
  • Involved in Data Ware house design, data integration and data transformation using Apache Spark and Python.
  • Created/Setup EMR clusters for running data engineering work loads and data scientists.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Involved in setting up automated jobs and deploying machine learning model using Azure DevOps pipelines.
  • Worked on developing, building, and maintaining a highly available, secure, and multi-zone AWS cloud
  • Involved in design and deployment of a multitude of Cloud services on AWS stack such as Route53, S3, RDS, Dynamo DB, RDBMS, SNS, SQS, IAM, while focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Excellent knowledge and experience in OLTP/OLAP System Study with focus on Oracle Hyperion Suite of technology, developing Database Schemas like Star schema and Snowflake schema (Fact Tables, Dimension Tables) used in relational, dimensional, and multidimensional modelling, physical and logical Data modelling
  • Experience with various Online Analytical Processing tools (OLAP) like designing Data Marts and Data Warehouse using Star Schema and Snowflake Schema in implementing Decision Support Systems, fact and dimension tables modelling of data at all the three levels: view, logical and physical
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
  • Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena, and Snow-Flake
  • Extensive experience in Big Data Analytics with hands-on experience in writing Map Reduce jobs on Hadoop Ecosystem including Hive, Pig, HBase, Sqoop, Impala, Oozie, Airflow, Zookeeper, Spark, Kafka, Cassandra, and Flume.
  • Designed SSIS Packages to extract, transfer, and load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
  • SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary,
  • Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
  • Tokenization Implementation using DTAAS, Privateer for supporting REST Level, file level, and field level encryption.
  • Data migration experience from on premise to cloud by using command line tools like AZ Copy, Az PowerShell and Azure CLI
  • Kafka Connects: AWS S3 Sink Connect, Salesforce Connect, Splunk Connect, and HDFS Connect.
  • CMP (Market place) Integration with Kafka Admin API's.
  • Worked on Data Lake in AWS S3, Copy Data to Redshift, Custom SQL's to implement business Logic using Unix and Python Script Orchestration for Analytics Solutions
  • Schema Registry API's development and Implemented to PROD.
  • Developed and Implemented Data Solutions utilizingAzureServices like Event Hub,AzureData Factory, ADLS, Databricks,Azureweb apps,AzureSQL DB instances.
  • Involved in setting up automated jobs and deploying machine learning model using Azure DevOps pipelines.
  • Worked in Athena, AWS Glue, and Quick Sight for visualization purposes.
  • Retrieved data from DBFS intoSpark Data Frames,for running predictive analytics on data.
  • Used Hive Context which provides a superset of the functionality provided by SQL Context and preferred to write queries using the HiveQL parser to read data from Hive tables.
  • ModelledHivepartitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
  • Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.
  • Caching of RDDsfor better performance and performing actions on each RDD.
  • Developed highly complexPython code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
  • Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
  • Worked on Kafka REST API to collect and load the data on Hadoop file system and used Sqoop to load the data from relational databases.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

Environment: PySpark, Hive, SQOOP, Kafka, Python, Spark streaming, DBFS, SQL Context, Spark RDD, REST API, Spark SQL, Hadoop, SQOOP, Parquet files, Oracle, SQL Server.

Confidential - Purchase, NY

Data Engineer

Responsibilities:

  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on developing PySpark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures and Views.
  • Handle importing other enterprise data from different data sources into HDFS using JDBC and Load Hadoop in Big SQL and perform transformations using Spark API to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from upstream in near real time and persists into HBase.
  • Working with different file structures with different Hive file formats like Text file, Sequence file, ORC file, Parquet and Avro to analyze the data to build data model and reading them from HDFS and processing through parquet files and loading into HBASE tables.
  • Develop the Batch jobs using Scala programming Language to process the data from files and tables, transform the data with the business logic and deliver it to the user
  • Work in Continuous Deployment module, which is used to create new tables or to update the existing table structure if needed in different environments along with DDL (Data Definition Language) creation for the tables
  • Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage.
  • Created Pipelines in ADF (Azure Data Factory) using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data.
  • Loading data from Linux/Unix file system to HDFS and working with PUTTY for the better communication between UNIX and Window system and for accessing the data files in the Hadoop environment.
  • Developed and Implement HBase capabilities for big de-normalized data set and then apply transformation on the de-normalized data set using Spark/Scala.
  • Involved Spark tuning to improve the Jobs performance based on the Pepper Data monitoring tool metrics.
  • Worked in building application platforms in the Cloud by leveraging Azure Databricks
  • Develop shell scripts for configuration checks and files transformation which is to be done before loading the data into Hadoop Landing area HDFS
  • Developed and Implement Spark ETL custom component to extract the data from upstream systems and push the data to HDFS and finally store the data in HBase with wide row format
  • Work with Apache Hadoop environment by Hortonworks
  • Enhance the application with new features and make the performance improvement in all the modules of the application
  • Exposure to Microsoft Azure in the processing of moving the on-prem data to azure cloud.
  • Exploring and applying spark techniques like partitioning the data with Keys and writing it to parquet files which improve performance improvement.
  • DevelopedPython-basedAPI(RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI(Key Performance Indicator) calculator Sheet and maintain that sheet within SharePoint.
  • Created reports with complex calculations, designed dashboards for analysing POS data and developed visualizations and worked on Ad-hoc reporting usingTableau.
  • Creating data model that correlates all the metrics and gives a valuable output.
  • Designed Spark based real-time data ingestion and real-time analytics, created Kafka producer to synthesize alarms using Python also used Spark-SQL to Load JSON data and create SchemaRDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Involved in convertingHive/SQLqueries intoSparktransformations usingSpark RDDs,Python.
  • Developed data pipeline usingSpark,Hive,Pig, andpython to ingest customer.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts. Developed Hive and MapReduce tools to design and manage HDFS data blocks and data distribution methods.
  • Worked onAWS Data pipelineto configure data loads fromS3to intoRedshift.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.

Environment: SparkSQL, PySpark, SQL, RESTful Web Service, Tableau, Kafka, JSON, Hive, Pig, Hadoop, HDFS, MapReduce, S3, Redshift, AWS Data pipeline, Amazon EMR.

Confidential

Data Engineer

Responsibilities:

  • Defined data contracts, and specifications including REST APIs.
  • Worked on relational database modelling concepts in SQL, performed query performance tuning.
  • Worked on Hive Meta store backup, Partitioning and bucketing techniques in hive to improve the performance and tuning Spark Jobs
  • Responsible to build and run resilient data pipelines in production and implemented ETL/ELT to load a multi-terabyte enterprise data warehouse.
  • Worked closely with Data science team and understand the requirement clearly and create hive table on HDFS.
  • Developed Spark scripts by using python commands as per the requirement.
  • Solved performance issues in Spark with understanding of groups, joins and aggregation.
  • Scheduling Spark jobs in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.
  • Experience in using the EMR cluster and various EC2 instance types based on requirements.
  • Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hive and written Hive UDFs.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions using Python and PySpark.
  • Worked on the development of SQL and stored procedures on MySQL.
  • Designed and developed horizontally scalable APIs using Python Flask.
  • Associated with the development of web services using SOAP for sending and getting data from the external interface in the XML format.
  • Developed the required XML Schema documents and implemented the framework for parsing XML documents. Worked on Performance tuning of SQL, PL/SQL, and analyzed tables in Oracle
  • Created UNIX scripts and Python code to timely generate the e-receipts and invoices in a well-formatted manner.
  • Developed source code and executed bug fixes for several resource-intensive modules Django
  • Involved in the QA team during testing phase
  • Worked efficiently on making the user interface
  • Designed and developed Map Reduce program to analyse & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact on flight historical data.
  • Created End-to-end ETL pipeline for data processing for created dashboards to business using PySpark.
  • Developing spark programs using python APIs to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
  • Developed data flows and processes for the Data processing using SQL (SparkSQL & Data frames).
  • Understood business requirements and prepared design documents, coding, testing and go on live production environment.
  • Implemented analytics applications using multiple database technologies, such as relational, multidimensional (OLAP), key-value, document, or graph.
  • Built cloud-native applications and supporting technologies practices including AWS, Docker, CI/CD and microservices.
  • Involved in planning process of iterations under the Agile Scrum methodology.

Environment: Hive, PySpark, HDFS, Python, EMR, EC2, UNIX, S3 files, SQL, MapReduce, ETL/ELT, Docker, REST API, Agile Scrum, OLAP (Online Analytical Processing).

Confidential

Data Engineer

Responsibilities:

  • Involve in all phases of SDLC (Software Development Life Cycle) which includes requirement collection, design and analysis, development, and deployment of the application.
  • Architecture and design of business requirements and to make Visio Diagrams for the design and to develop the application and deploy the application in various environments.
  • Develop Spark 2.1/2.4 Scala component to process the business logic and store the computation results of 10 TB data into HBase database to access the downstream web apps using Big SQL db2 database.
  • Uploaded and processed more than 10 terabytes of data from various structured and unstructured sources into HDFS using Sqoop and Flume.
  • Test the developed modules in the application using Junit Library and Junit testing Framework
  • Analyse structured, unstructured data, and file system data and loading the data to HBase tables based on the project requirement using IBM Big SQL with Sqoop mechanism and processing the data using Spark SQL in memory computation &processing results to Hive, HBase
  • Handle importing other enterprise data from different data sources into HDFS using JDBC and Load Hadoop in Big SQL and perform transformations using Spark API to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from upstream in near real time and persists into HBase.
  • Working with different file structures with different Hive file formats like Text file, Sequence file, ORC file, Parquet and Avro to analyse the data to build data model and reading them from HDFS and processing through parquet files and loading into HBASE tables.
  • Develop the Batch jobs using Scala programming Language to process the data from files and tables, transform the data with the business logic and deliver it to the user
  • Work in Continuous Deployment module, which is used to create new tables or to update the existing table structure if needed in different environments along with DDL (Data Definition Language) creation for the tables
  • Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage.
  • Created Pipelines in ADF (Azure Data Factory) using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data.
  • Loading data from Linux/Unix file system to HDFS and working with PUTTY for the better communication between UNIX and Window system and for accessing the data files in the Hadoop environment.
  • Developed and Implement HBase capabilities for big de-normalized data set and then apply transformation on the de-normalized data set using Spark/Scala.
  • Involved Spark tuning to improve the Jobs performance based on the Pepper Data monitoring tool metrics.
  • Worked in building application platforms in the Cloud by leveraging Azure Databricks
  • Develop shell scripts for configuration checks and files transformation which is to be done before loading the data into Hadoop Landing area HDFS
  • Developed and Implement Spark ETL custom component to extract the data from upstream systems and push the data to HDFS and finally store the data in HBase with wide row format
  • Work with Apache Hadoop environment by Hortonworks
  • Enhance the application with new features and make the performance improvement in all the modules of the application
  • Exposure to Microsoft Azure in the processing of moving the on-prem data to azure cloud.
  • Exploring and applying spark techniques like partitioning the data with Keys and writing it to parquet files which improve performance improvement.
  • Understand the Mapping documents, existing Source Data and preparing load strategies for different source systems and implement them using Hadoop technology
  • Work with Continuous integration tools like maven, Team city, IntelliJ and scheduling the jobs with TWS (Tivoli Workload Scheduler) tool,
  • Creating and cloning the jobs and Job streams in TWS tool and promoting them to higher environments.

Environment: Scala, Java, Spark framework, Linux, Jira, Bit bucket, IBM Big SQL, Hive, HBase, Maven, DB2 Visualizer, ETL, Windows, Azure Data Factory, Linux

We'd love your feedback!