We provide IT Staff Augmentation Services!

Senior Data Engineer Resume

4.00/5 (Submit Your Rating)

Phoenix, AZ

SUMMARY

  • Around 7 years of professional Hadoop and 3+ years in AWS/Azure Data Engineering, Data Science and Big data implementation experience in utilizing PySpark for Ingestion, storage, querying, processing, and analysis of big data.
  • Expertise on programming in different technologies i.e., Python, Java, SQL. Good understanding of data wrangling concepts using Pandas and Numpy.
  • Experience withAzureData Platform stack:AzureData Lake, Data Factory and Databricks
  • Practical experience with AWS technologies such as EC2, Lambda, EBS, EKS, ELB, VPC, IAM, ROUTE53, Autoscaling, Load Balancing, Guard Duty, AWS Shield, AWS Web Application Firewall (WAF), Network Access Control List (NACL), S3, SES, SQS, SNS, SES, AWS Glue, Quick Sight, Sage maker, Kinesis, Redshift, RDS, DynamoDB, Datadog, ElastiCache (Memcached & Redis).
  • Extracted Meta Data from Amazon Redshift, AWS, and Elastic Search engine using SQL Queries to create reports.
  • Developed and Implemented Data Solutions utilizingAzureServices like Event Hub,AzureData Factory, ADLS, Databricks,Azureweb apps,AzureSQL DB instances.
  • Working knowledge of AWS CI/CD services such as CodeCommit, CodeBuild, CodePipeline, CodeDeploy, and creating Cloud Formation templates for infrastructure as code. Control Tower was used to create or administer our multi - account AWS infrastructure following best practices.
  • Implement AWS Lambdas to drive real-time monitoring dashboards from system logs.
  • Experienced in running spark jobs on AWS EMR and using the EMR cluster and various EC2 instance types based on requirements.
  • Developed PySpark scripts interacting with various data sources like AWS RDS, S3, Kinesis and distributed file types such as ORC, Parquet and Avro.
  • Experience with AWS Multi-Factor Authentication (MFA) for RDP/SSO logon, working with teams to lock down security groups and build specific IAM profiles per group using recently released APIs for restricting resources within AWS depending on group or user.
  • Configure Jenkins to build CI/CD pipeline which includes to trigger auto builds, auto promote builds from one environment to another, code analysis, auto version etc. for various projects.
  • Worked in highly collaborative operations team to streamline the process of implementing security Confidential Azure cloud environment and introduced best practices for remediation.
  • Hands on experience with Azure Data Lake, Azure Data Factory, Azure Blob and Azure Storage Explorer.
  • Created Splunk dashboards for CloudWatch logs and monitored the whole environment using Glass tables and worked on regular alerts.
  • Experience in using various Amazon Web Services (AWS) Components like EC2 for virtual servers, S3 and Glacier for storing objects, EBS, Cloud Front, Elastic cache and Dynamo DB for storing data.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
  • Experience in Configuration Management, setting up company Version policies, build schedule using SVN, Git.
  • Good experience with use-case development, with Software methodologies like Agile and Waterfall.

TECHNICAL SKILLS

Cloud: AWS, Azure(Azure Databricks)

Testing methods: Selenium 2.0, HP QTP 11.0, SOAP UI.

RDBMS: Oracle, SQL Server, DB2, MySQL, PGADMIN, RedShift, Cosmos DB

Languages: Apache Spark, Python, SQL, PL/SQL, HTML, DHTML, UML

Version tools: SVN, GIT

Automation Tools: Jenkins, Azure DevOps, Code Pipeline

Scripting languages: Python, shell Scripting, PowerShell Scripting YAML, JSON

Agile Tool: JIRA

Infrastructure as Code: CloudFormation, Terraform

PROFESSIONAL EXPERIENCE

Confidential, Phoenix, AZ

Senior Data Engineer

Responsibilities:

  • Implement AWS Lambdas to drive real-time monitoring dashboard of Kinesis streams.
  • Involved in Data Ware house design, data integration and data transformation using Apache Spark and Python.\
  • Created/Setup EMR clusters for running data engineering work loads and data scientists.
  • Experience in data warehouse modelling techniques such as Kimball modelling
  • Experience in Conceptual data modelling, Logical data modelling and Physical data modelling.
  • UtilizedSparkSQL API inPySparkto extract and load data and perform SQL queries.
  • Utilized Azure Synapse and Azure Databricks to create a data pipelines in Azure.
  • Developed and Implemented Data Solutions utilizingAzureServices like Event Hub,AzureData Factory, ADLS, Databricks,Azureweb apps,AzureSQL DB instances.
  • Involved in setting up automated jobs and deploying machine learning model using Azure DevOps pipelines.
  • Involved in design and deployment of a multitude of Cloud services on AWS stack such as Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM, while focusing on high-availability, fault tolerance, and auto-scaling in AWS CloudFormation.
  • Worked in Athena, AWS Glue, Quick Sight for visualization purposes.
  • Created data pipelines using data factory and data bricks for ETL processing.
  • Retrieved data from DBFS intoSpark Data Frames,for running predictive analytics on data.
  • Used Hive Context which provides a superset of the functionality provided by SQLContext and preferred to write queries using the HiveQL parser to read data from Hive tables.
  • ModelledHivepartitions extensively for data separation and faster data processing and followed Hive best practices for tuning.
  • Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.
  • Experience in ExploratoryData Analysis (EDA), Feature Engineering, Data Visualisation
  • Caching of RDDsfor better performance and performing actions on each RDD.
  • Developed highly complexPython code, which is maintainable, easy to use, and satisfies application requirements, data processing and analytics using inbuilt libraries.
  • Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS.
  • Worked on Kafka REST API to collect and load the data on Hadoop file system and used Sqoop to load the data from relational databases.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.

Environment: PySpark, Hive, SQOOP, Kafka, Python, Spark streaming, DBFS, SQL Context, Spark RDD, REST API, Spark SQL, Hadoop, SQOOP, Parquet files, Oracle, SQL Server.

Confidential, Atlanta, GA

Data Engineer

Responsibilities:

  • UtilizedSparkSQL API inPySparkto extract and load data and perform SQL queries.
  • Worked on developingPySparkscript to encrypting the raw data by using hashing algorithms concepts on client specified columns.
  • Responsible for Design, Development, and testing of the database and Developed Stored Procedures and Views.
  • DevelopedPython-basedAPI(RESTful Web Service) to track revenue and perform revenue analysis.
  • Compiling and validating data from all departments and Presenting to Director Operation.
  • KPI(Key Performance Indicator) calculator Sheet and maintain that sheet within SharePoint.
  • Created reports with complex calculations, designed dashboards for analysing POS data and developed visualizations and worked on Ad-hoc reporting usingTableau.
  • Creating data model that correlates all the metrics and gives a valuable output.
  • Designed Spark based real-time data ingestion and real-time analytics, created Kafka producer to synthesize alarms using Python also used Spark-SQL to Load JSON data and create SchemaRDD and loaded it into Hive Tables and handled Structured data using Spark SQL.
  • Involved in convertingHive/SQLqueries intoSparktransformations usingSpark RDDs,Python.
  • Developed data pipeline usingSpark,Hive,Pig, andpython to ingest customer.
  • Loaded and transformed large sets of structured, semi structured, and unstructured data using Hadoop/Big Data concepts. Developed Hive and MapReduce tools to design and manage HDFS data blocks and data distribution methods.
  • Worked onAWS Data pipelineto configure data loads fromS3to intoRedshift.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR.

Environment: SparkSQL, PySpark, SQL, RESTful Web Service, Tableau, Kafka, JSON, Hive, Pig, Hadoop, HDFS, MapReduce, S3, Redshift, AWS Data pipeline, Amazon EMR.

Confidential

Data Engineer

Responsibilities:

  • Defined data contracts, and specifications including REST APIs.
  • Worked on relational database modelling concepts in SQL, performed query performance tuning.
  • Worked on Hive Meta store backup, Partitioning and bucketing techniques in hive to improve the performance and tuning Spark Jobs
  • Responsible to build and run resilient data pipelines in production and implemented ETL/ELT to load a multi-terabyte enterprise data warehouse.
  • Worked closely with Data science team and understand the requirement clearly and create hive table on HDFS.
  • Developed Spark scripts by using python commands as per the requirement.
  • Solved performance issues in Spark with understanding of groups, joins and aggregation.
  • Scheduling Spark jobs in Hadoop Cluster and Generated detailed design documentation for the source-to-target transformations.
  • Experience in using the EMR cluster and various EC2 instance types based on requirements.
  • Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hive and written Hive UDFs.
  • Was responsible for creating on-demand tables on S3 files using Lambda Functions using Python and PySpark.
  • Designed and developed Map Reduce program to analyse & evaluate multiple solutions, considering multiple cost factors across the business as well operational impact on flight historical data.
  • Created End-to-end ETL pipeline for data processing for created dashboards to business using PySpark.
  • Developing spark programs using python APIs to compare the performance of spark with HIVE and SQL and generated reports monthly and daily basis.
  • Developed dataflows and processes for the Data processing using SQL (SparkSQL & Data frames).
  • Understood business requirements and prepared design documents, coding, testing and go on live production environment.
  • Implemented analytics applications using multiple database technologies, such as relational, multidimensional (OLAP), key-value, document, or graph.
  • Built cloud-native applications and supporting technologies practices including AWS, Docker, CI/CD and microservices.
  • Involved in planning process of iterations under the Agile Scrum methodology.

Environment: Hive, PySpark, HDFS, Python, EMR, EC2, UNIX, S3 files, SQL, MapReduce, ETL/ELT, Docker, REST API, Agile Scrum, OLAP (Online Analytical Processing).

We'd love your feedback!