Data Engineer Resume

SUMMARY

IT proficient with 7+ years of experience, expertise in Big Data ecosystem - Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing.
Hands on experience in Azure applications (Paas &Iaas) Azure synapse analytics, SQL Azure, Azure Data Lake, Azure Data Factory, Azure Analysis Service, Azure Data bricks, Azure monitoring and Key vault.
Experience incontrolling and granting database access and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
Extensive experience extracting and loading data from relational databases such asTeradata, Oracle and DB2 into Azure Data Lake storage via Azure Data Factory.
Extensive knowledge in data analysis, TSQL queries, ETL Process, Reporting Services (using SSRS, Power Bl) and Analysis Services using SQL Server 2017 SSS, SSRS and SSAS, SQL Server Agent.
Hands-on experience in Amazon Web Services (AWS) Cloud services such as EC2, VPC, SB. AM. EBS, RDS. ELB, VPC. Route53. Ops Works, DynamoDB, Autoscaling, CloudFront, CloudTrail, CloudWatch, CloudFormation, Elastic Beanstalk, AWS SNS. AWS SQS, AWS SES, AWS SWF &. AWS Direct Connect.
Strong working experience withSQLandNoSQLdatabases (Cosmos DB, MongoDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
Worked on distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift and interact with data in other AWS data stores such as Amazon 53 and Amazon DynamoDB.
Experience with building data pipelines in Python/Pyspark/HiveSQL/Presto/BigQuery and building python DAG in Apache Airflow.
Experience in working on python libraries like Numpy, pandas, Boto3.
Skilled in System Analysis, E-R/Dimensional DataModeling, Database Design and implementing RDBMS specific features.
Experienced indeveloping production-ready Spark applications using Spark Components such as Spark SQL, MLlib, GraphX, DataFrames, Datasets, Spark-ML, and Spark Streaming.
Expertise in deploying Kubernetes clusters in a cloud environment using cloud formation templates and PowerShell scripting.
Experience installing/configuring/maintaining Apache Hadoop clusters for application development andHadooptools like Sqoop, Hive, PIG, HBase, Kafka, Hue, Oozie, Spark, Scala and Python.
Experience in designing and developing applications in Spark using Python to compare the performance of Spark with Hive.
Good working experience with Hive and HBase/MapRDB Integration.
Experience converting Hive/SQL queries into Spark transformations using Spark Data Frames, Python.
Hands on experience on various reports, dashboards using PowerBi and Tableau
Experienced in developingshell scripts and Python scripts to automate Spark jobs and Hive scripts.
Excellent communication, interpersonal, and organizational skills, as well as the ability to manage multiple projects. Always eager to learn and adopt new technologies.

TECHNICAL SKILLS

Hadoop/Big Data Technologies: Hadoop, Map Reduce, Oozie, Hive, Scoop, Spark, Nifi, Zookeeper and Cloudera Manager, Airflow.

NO SQL Database: HBase, Dynamo DB.

Monitoring and Reporting: PowerBI, Tableau, Custom shell scripts.

Hadoop Distribution: Horton Works, Cloudera.

Application Servers: JDBC.

Build Tools: Maven

Programming & Scripting: Python, Scala, SQL, Shell Scripting.

Databases: Oracle, MY SQL, Teradata

Version Control: GIT,bitbucket

IDE Tools: Eclipse, Jupyter, Pycharm.

Operating Systems: Linux, Unix, Ubuntu, CentOS, Windows

Cloud: AWS, Azure.

Cluster Managers: Docker

Development methods: Agile, Waterfall

PROFESSIONAL EXPERIENCE

Confidential

Data Engineer

Responsibilities:

Developed design patterns for feeding data into the data lake from a variety of sources and standardizing it to enable enterprise-level benchmarking and comparison.
Implement data standards and maintain data quality, master data management and knowledge of data sources.
Expert in developing Data bricks notebooks for extracting data from various source systems such as DB2, Teradata, and performing data cleansing, wrangling, ETL processing, and loading to Azure SQL DB.
ETL operations were carried out in Azure Data Factory by connecting to various relational database source systems via JDBC connectors.
Configured data pipelines using Azure Data Factory, and a custom alerts platform was built for monitoring. Ingested data in mini batches and performed RDD transformations suing spark streaming analytics in Azure Data Bricks
Custom alerts queries were created in Log Analytics, and custom alerts were automated using Web hook actions.
Developed Spark applications using python libraries like Pyspark.
Experienced in Spark-SQL for data extraction, data transformation, and data aggregation from multiple file formats for analyzing & transforming.
Validate Databricks by developing python scripts and automated the process using ADF.
Analyzed the SQL scripts and designed it by using Pyspark SQL for faster performance.
Used Pyspark for reading and writing data formats such as JSON, Delta and Parquet files from different sources
Developed spark applications in Python (Pyspark) on a distributed platform to load a large data CSV file with varying schema into Pyspark Data frames and process them for reloading into Azure SQL DB tables.
Using PySpark SQL, I monitored the SQL scripts and modified them for improved performance.
Developed Spark code using python 3 for Pyspark/Spark-SQL for faster testing and processing of data.
Managing secret keys through Azure key vault and configuring APIs to access the key vault through authentication process.
Built Power BI reports developedSQL queries with stored procedures, common table expressions (CTEs), and a temporary table.
Deployed Azure IaaS (Infrastructure as a service) virtual machines (VMs) and Cloud services (PaaS role instances) into secure VNets and also subnets.
Used SSIS (SQL server integration service) to create Multidimensional cubes.
Managed development of ETL processes independently.
Built ETL data pipelines to input data from Blob storage to Azure Data Lake Gen2 using Azure Data Factory (ADF).
Designed and developed user interfaces and customization of Reports using Tableau.
Designed cubes for data visualization, mobile/web presentation with parameterization and cascading.
Involved in creation of CI/CD pipelines.
Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing of data in Azure Data bricks.
Mappings and Sessions based on business user requirements and business rules to load data from source flat files and RDBMS tables to target tables
Processed structured and semi-structured data into Clusters using Spark SQL and Data Frames API.

Confidential

Data Engineer

Responsibilities:

Extensively worked on AWS S3 data transfer and AWS Redshift was used for cloud data storage.
Handled data extraction and data ingestion from different data sources into S3 by creating ETL pipelines using Spark.
Extensively worked with pyspark / Spark SQL for data cleansing and generating data frames and RDDs.
Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers.
Developed Spark Programs for Batch Processing.
Experienced in working with Spark SQL on different file formats like XML, JSON and Parquet.
Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients.
Handled formats like Sequence files, XML files and Map files using Map Reduce Programs.
Experienced in transforming batch data from different sources by using different PySpark API.
Implemented usage of Amazon EMR for processing Big Dataacross Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) AWS Redshift.
Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready datasets onto Snowflake analytical environment.
Experienced in maintaining a Hadoop cluster on AWS EMR.
Used spark to build tables that require multiple computations and non equi-joins.
Modeled Hive partitions extensively for faster data processing.
Used Spark API over Cloudera Hadoop YARN to perform analytics on Datain Hive.
Implemented various udfs in python as per the requirement.
Used Bit Bucket to collaboratively interact with the other team members.
Proficient with container systems such as Docker and container orchestration tools such as EC2 Container Service and Terraform.
Created aDataPipeline utilizing Processor Groups and numerous processors in Apache Nifi for Flat File, RDBMS as part of a Proof of Concept (POC) on Amazon EC2.
Extensively used AWS Athena to import structured data from S3 into other systems such as Red Shift or to generate reports.

Confidential

Data Engineer

Responsibilities:

Involved in the entire implementation lifecycle, with a focus on custom MapReduce and Hive code.
Data transformations were performed in Hive, and partitions and buckets were used to improve performance.
Hive queries were extensively used to query or search for a string in Hive tables in HDFS.
Experienced in handling HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming.
Configured and monitored resource utilization across the cluster using Cloudera Manager, Search, and Navigator.
Create external Hive tables for consumption and store data in HDFS using the ORC, Parquet, and Avro file formats.
ETL Pipelines were created using the Apache PySpark - Spark SQL and Data Frame APIs.
Analyzed Hadoop clusters and various Big Data analytic tools such as Pig, hive, HBase, Spark, and Sqoop.
Used Sqoopto load data into the cluster from dynamically generated files and relational database management systems.
Partitioning, dynamic partitions, and buckets had been implemented in HIVE.
Developed HQL queries, Mappings, tables, and external tables in Hive for analysis across multiple banners, as well as worked on partitioning, optimization, compilation, and execution.
Cloudera Manager is used to continuously monitor and manage the Hadoop cluster.
Mappings done with reusable components such as worklets and mapplets, as well as other transformations.
Automated data movement between different components by using Apache NiFi.
Loading data from multiple data sources (SQL, DB2, and Oracle) into HDFS using Sqoop and storing it in Hive tables.
Extracted data from Teradata into HDFS using Sqoop.
Sqoop was used to return the analyzed patterns to Teradata.
Expert in Developing SSIS packages to ETL data into heterogeneous data warehouse.

Confidential

Data Analyst

Responsibilities:

Understand Requirement, Analyzing Systems and Source Databases.
Create basic calculations including string manipulation, basic arithmetic calculations, custom aggregations and ratios, date math, logic statements and quick table calculations.
Representing Data using the Cross Tabs, Scatter Plots, Geographic Maps, Pie Charts and Bar Charts, Heat Maps, Tree Maps.
Build advanced chart types and visualizations such as Bar in Bar Charts, Bullet Graphs, Box and Whisper Plots, Donut Charts, Waterfall Charts, Combo Charts.
Use groups, bins, hierarchies, sorts, sets, actions and filters to create focused and effective visualizations.
Performed unit testing of reports.
Perform Visualizations using Data engine, extracts and connect them.
Extensively worked on creating views and tables with MSSQL
Organize and validate data before sending it to client.

Confidential

Software Engineer

Responsibilities:

Analyzed, implemented, and solved research problems using various numerical and stochastic methods in C++ and Python modules for computer experiment design and analysis.
Design and develop a user interface using HTML, AJAX, CSS, Java Script.
Designed and developed data management system using MySQL.
Created class diagrams/sequence diagrams using UML and Rational Rose.
Modify and execute Python / Django modules to change data format.
Hands on with accessing database objects of Django Database API's.
Written pythonscripts to load the data in database for parsed XML documents.
Handled all the client-side validation using JavaScript.
Expertise in writing Constraints, Indexes, Views, Stored Procedures, Cursors, Triggers, and User Defined function.
Workedwith SQL and stored procedures development on MySQL and SQLite.
Participated in the creation of SOAP Web Services for transmitting and receiving data in XML format from an external interface.
Participated in the optimization of Python Object Oriented Design code for monitoring, quality, debugging and logging.
Frontend and backend modules are tested using Django Web Framework.
NumPy was used for numerical analysis, andMatPlotLib libraries from the sci-py kit were used for data analysis and plotting.
Constructedthe pipelines, run the tests in Jenkins, and deploy the application to AWS.
JIRA was used to track agile/scrum process and development status.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship