Data Engineer Resume

SUMMARY

Data Engineer with a master's degree in Business Analytics and 5 plus years’ experience in Predictive Analytics, having strengths in data analysis, data visualization, statistical modeling, and regression analysis.
Functioned as Data Engineer responsible for data modelling, data migration, design, preparing ETL pipelines for both cloud and on Exadata.
Good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
Developed ETL scripts for data acquisition and transformation using Informatica and Talend.
Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS).
Validated data fields from downstream source to ensure uniformity of data.
Wrote Spark applications for data validation, cleansing, transformation, and aggregation.
Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write - back tool and backwards.
Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs
Good experience working on AWS-BigData/Hadoop Ecosystem in the implementation of DataLake.
Strong hands on experience with AWS services, including but not limited to EMR, S3, EC2, route S3, RDS, ELB, Dynamo DB, Glue, SNS, SQS, Cloud Formation, etc and Hands-on experience with Redshift Spectrum and AWS Athena query services for reading the data from S3
Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
Worked on Application/ Platform Consolidation and Re-hosting, Legacy, Conversion/ Retirement, ETL data pipeline development for operational data stores & analytical warehouse.
Responsible for writing Unit Tests and deploy production level code through the help of Git version control.
Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity.
Worked onAzureDataFactory andAzureDatabricks as part of EDS transformation.

TECHNICAL SKILLS

Programming: Python, R,SQL

Tools: Python Environment, RStudio, Excel, SAS Weka, Rattle, IBM SPSS

Data Visualization Tools: Tableau, Power BI,Clickstream

Database: Relational(MySQL),Mango DB, Cassandra

Python Libraries: Pandas, NumPy, Scikit-Learn, Sci-Py, Matplotlib, Sea born

Analytical Tools: Jira, Git

Big Data: Hypothesis Testing, Statistical Modeling, Quantitative and Qualitative Analysis, Statistical Computing Methods, Regression Analysis Hadoop, Spark

Cloud: Amazon EC2,Amazon EMR, Amazon LAMBDA, Amazon GLUE, Amazon S3, Amazon ATHENA, Azure Data Lake, Azure Data Factory, Azure Databrick, Azure SQL Database, Azure SQL data Warehouse

Machine Learning Algorithms: Regression Methods, Supervised and Unsupervised machine learning, Decision Tree, SVM, K-Nearest Neighbors

DataBases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL, HBase, Cassandra, Mongo DB, DynamoDB, Glacier.

Statistical Analysis: Hypothesis Testing, Statistical Modeling, Quantitative and Qualitative Analysis, Statistical Computing Methods, Regression Analysis

PROFESSIONAL EXPERIENCE

Data Engineer

Confidential

Responsibilities:

Developed Spark RDD transformations, actions, and Data Frame's, case classes, Datasets for the required input data and performed the data transformations using Spark-Core.
Create Data pipelines for Kafka cluster and process the data by using spark streaming and worked on streaming data to consume data from KAFKA topics and load the data to landing area for reporting in near real time.
Involved in converting Hive/SQL queries into Spark Transformations using Spark RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.
Documented logical data integration (ETL) strategies for data flows between disparate source/target systems for structured and unstructured data into common data lake and the enterprise information repositories
Migrate on in-house database to AWS Cloud and designed, built, and deployed a multitude of applications utilizing the AWS stack (Including S3, EC2, RDS, Redshift, Athena) by focusing on high-availability and auto-scaling.
Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis and used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS
Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Databricks cluster. Moreover, managing the clusters in Databricks platform
Involved in designing the data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
Responsible for creating on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark and Created Hive DDL on Parquet and Avro data files residing in both HDFS and S3 bucket
Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing, analyzing and training the classifier using MapReduce jobs, Pig jobs and Hive jobs.
Worked on Spark streaming collects the data from Kafka in near real time and performs necessary transformations and aggregations on the fly to build the common learner data model and persists the data in Cassandra.
Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements.
Involved in Data modeling using ER Studio identified objects and relationships and how those all fit together as logical entities, these are then translated into physical design using forward engineering ER Studio tool.
Updated Python scripts to match training data with our database stored in AWS Cloud Search, so that we would be able to assign each document a response label for further classification.
Created monitors, alarms, notifications and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch and used AWS Glue for the data transformation, validate and data cleansing.
Worked with cloud-based technology like Redshift, S3, AWS, EC2 Machine, etc. and extracting the data from the Oracle financials and the Redshift database and Create Glue jobs in AWS and load incremental data to S3 staging area and persistence area.
Handled importing data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
Deployed applications using Jenkins framework integrating Git- version control with it.
Used the Agile Scrum methodology to build the different phases of Software development life cycle.
Improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN.
Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability and Performed Exploratory Data Analysis and Data Visualizations using Python, and Tableau.

Environment: Hadoop/Bigdata Ecosystem (Spark, Kafka, Hive, HDFS, Sqoop, Oozie, Cassandra, mongoDB), AWS (S3, AWS Glue, Redshift, RDS, Lambda, Athena, SNS, SQS, Cloud Formation), Oracle, Jenkins, Docker, Git, SQL Server, SQL, Java, PostgreSQL, Python, Pyspark, Teradata, Tableau, Quick sight, ER Studio, Data warehousing, Databricks

Data Engineer

Confidential

Responsibilities:

Involved in Data Collection and performed various Data assessment operations like Data Cleansing, Data Scrubbing, Data Standardization, Data Profiling and maintained Data governance.
Coordinate efforts between the Data Owners, Business Analyst and other relevant parties to facilitate quick resolution of any misunderstanding and clarifications related to the Data in the source systems
Used various DML and DDL commands like Select, Insert, Update, Sub Queries, Inner Joins, Outer Joins, Union, Advanced SQL etc. for the Data retrieval and manipulation
Optimized data assets to drive customer insight and business recommendations
Built cloud data systems and pipelines
Used Informatica power center 9.6.1 to Extract, Transform and Load data into Netezza Data Warehouse from various sources like Oracle and flat files.
Involved in migration of the maps from IDQ to power center
Wrote, tested and implemented Teradata Fast load, Multiload and BTEQ scripts, DML and DDL.
Involved in logical and Physical Database design & development, Normalization and Data modeling using Erwin and SQL Server Enterprise manager.
Designed, developed and implemented ETL pipelines using python API (PySpark) of Apache Spark on AWS EMR.
Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage, experienced in configuring and maintaining the clusters on AWS EMR.
On demand secure EMR launcher with custom spark submit steps using S3 Event, SNS, KMS and Lambda function.
Loaded data into Spark RDD and in memory data Computation to generate the Output response.
Developed Impala queries to pre-process the data required for running the business process.
Actively involved in design analysis, coding and strategy development.
Developed Hive scripts for implementing dynamic partitions and buckets for history data.
Developed Spark scripts by using Scala per the requirement to read/write JSON files.
Involve in converting SQL queries into Spark transformations using Spark RDDs and Scala.
Analyzed the SQL scripts and designed the solution to implement using Scala.
Worked on creating data ingestion pipelines to ingest huge amount of Stream and customer application data intoHadoopin various file formats like raw text files, CSV, and ORC.
Worked extensively on integrating Kafka (Data Ingestion) with Spark streaming to achieve high-performance real-time processing system.

Data Analyst

Confidential

Responsibilities:

Uncovered categories of people their significant attributes and optimal K value through Clustering
Identified the important attributes for people across different age groups using classification models Logistic Regression, Support Vector Machine and Decision Trees to assign weights to attributes which was converted to reward points.
Parsed complex files through Informatica Data Transformations and loaded it to Database.
Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
Created dashboards and generated insights using Tableau and Python which helped in identifying members who are not engaged and people’s spending pattern. This enabled the credit union to target members based on their needs and age which enabled them to increase members engagement by 2%.
Weights were assigned to attributes based on their level of importance which in turn was used to provide rewards points to members of the union
Identified the spending pattern of members which helped the bank to understand where most of the money is spent, this information was used to provide offers and services to members which helped them to increase members engagement with respect to credit union and also helped in attracting new members to the bank

Data Analyst

Confidential

Responsibilities:

Prepared files and reports by performing ETL, data extraction, and data validation, managed metadata and prepared Data dictionary as per the project requirement
Requirement analysis to prepare Test Plan and Test cases, Execution of test cases as per requirements
Worked on Tokenization and Fraud Detection projects for Visa and Master Cards
Built arisk analytics web application to analyze the credit and investment risk using R (shiny, shiny dashboard) and JavaScript.
Assisted in development of data formatting, cleaning, analyzing, and documentation from various data sources.
Involved in Relational and Dimensional Data modeling for creating Logical and Physical Design of Database and ER Diagrams with all related entities and relationship with each entity based on the rules provided by the business manager using ER Studio.
Created Data Lake by extracting customer's data from various data sources into this includes data from Teradata, Mainframes, RDBMS, CSV and Excel.
Expert at Simple Storage S3 and Data storage buckets in AWS.
Good knowledge of AWS services like EC2, S3, Cloud Front, RDS, Dynamo DB, Elastic Search.
Used Spark Data frame APIs to ingest data from HDFS to AWS S3.
Involved in design and development of Data transformation framework components to support ETL process, which gets the Single Complete Actionable View of a customer.
Developed an ingestion module to ingest data into HDFS from heterogeneous data sources.
Built distributed in-memory applications using Spark and Spark SQL to do analytics efficiently on huge data sets.
Extensive experience inPL/SQL programming Stored Procedures, Functions, PackagesandTriggers.
Worked on SDLC process for gathering requirements and managing the entire project lifecycle
Worked on the development of Dashboard reports using Tableau for Key Performance Indicators(KPI’s) for the top management
Developed SQL queries to performed DDL, DML, and DCL.
Rebuilding existing Reports into Tableau Visualizations with user and action filters.
Ensured data accuracy and reliability of reports presented to the stakeholders
Translate business needs into reporting and analytic requirements.
Implemented Data Mining techniques to derive new insights from the data
Generated database reports, presentation, and records of analytical methods.
Analysis report, Test Report, Root Cause Analysis of defects raised and implementation of the interface and presentation to communicate the results to stake holders
Created and loaded mock Data, tested the Jira issues in QA and UAT env.
Worked as a production Support in resolving critical financial issues.

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship