Data Engineer Resume

PROFESSIONAL SUMMARY:

Over 8 years of experience in Data Engineering, Data Pipeline Design, Development and Implementation as a Sr. Data Engineer/Data Developer and Data Modeler.
Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and Beautiful Soup.
Hands - on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
Expertise in System and Server administration of Windows and Linux Environments.
Experience in installing, upgrading, patching, configuring and administrating Linux OS with proper release management.
Experience in Windows Azure Services like PaaS, IaaS and worked on storages like Blob (Page and Block), SQL Azure. Well experienced in deployment & configuration management and Virtualization.
Experience in Performance Monitoring, Security, Trouble shooting, Backup, Disaster recovery, Maintenance and Support of Linux systems.
Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage, Azure AD Licenses, Office365).
Hands on technology experience of designing and developing software applications with Microsoft .NET using C#, Classic ASP
Experience in managing Microsoft Windows server infrastructure and data - center operations by effectively planning, installing, configuring and optimize the IT infrastructure to achieve high availability and performance.
Expertise in setting up servers from scratch on clustered environment and load balancing.
Planning and implementing Disaster Recovery solutions, capacity planning, data archiving, backup/recovery strategies, Performance Analysis and optimization.
Developed various UI (User Interface) components using Struts (MVC), JSP, and HTML, JavaScript, AJAX.
Worked with various text analytics libraries like Word2Vec, GloVe, LDA and experienced with Hyper Parameter Tuning techniques like Grid Search, Random Search, model performance tuning using Ensembles and Deep Learning.
Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
Knowledge of working with Proof of Concepts (PoC's) and gap analysis and gatheird necessary data for analysis from different sources, prepared data for data exploration using data munging and Teradata.
Well experience in Normalization and De-Normalization techniques for optimum performance in relational and dimensional database environments.
Experience in developing customized UDF's in Python to extend Hive and Pig Latin functionality.
Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, for big data development.
Good working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills.

TECHNICAL SKILLS:

Hadoop Ecosystem: Map Reduce, Spark 2.3, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0

BI Tools: SSIS, SSRS, SSAS

Data Modeling Tools: Erwin Data Modeler, ER Studio v17

Programming Languages: SQL, PL/SQL, and UNIX.

Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile

Cloud Platform: AWS, Azure, Google Cloud.

Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Atana

Databases: Oracle 12c/11g, Teradata R15/R14. (8 years)

OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9

ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.

Operating System: Windows, Unix, Sun Solaris

PROFESSIONAL EXPERIENCE:

Data Engineer

Confidential

Responsibilities:

Responsible for analyzing the business requirement and estimating the tasks and preparing the mapping design documents for Confidential Point of sale(POS) and Direct sales (Digital sale) across all GOE’s.
Analyzed large and critical datasets using Cloudera, HDFS, MapReduce, Hive, Hive UDF, Pig, Sqoop and Spark.
Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
Developed Dashboard reports on Tableau.
Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
Experience in using Microsoft Azure SQL database, Data Lake, Azure ML, Azure data factory, Functions, Databricks and HDInsight.
Working experience in Big data on cloud using AWS EC2 & Microsoft Azure, and handled redshift & Dynamo databases with huge amount of data 300 TB.
Extensive experience in migrating on premise Hadoop platforms to cloud solutions using AWS and Azure.
Gatheird the business requirements from the Business Partners and Subject Matter Experts
Ingested data from RDBMS and performed data transformations, and tan export the transformed data to Cassandra as per the business requirement.
Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV & other compressed file formats.
Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data.
Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms
Developed PIG UDFs to provide Pig capabilities for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders and Implemented various requirements using Pig scripts.
Developed a Spark Streaming module for consumption of Avro messages from Kafka.
Experienced in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD’s in Scala.
Expertise in writing Scala code using Higher order functions for iterative algorithms in Spark for Performance considerations.
Expertise in snowflake to create and Maintain Tables and views.
Created Impala tables and SFTP scripts and Shell scripts to import data into Hadoop.
Created Hive tables and involved in data loading and writing Hive UDFs. Developed Hive UDFs for rating aggregation
Did various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
Experience in Data Extraction, Transformation and Loading of data from multiple data sources into target databases, using Azure Databricks, Azure SQL, PostgreSql, SQL Server, Oracle
Extract Transform and Load data from sources Systems to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data ingestion to one or more Azureservices (Azure Data Lake, Azure Storage, Azure SQL, Azure DW)and processing the data inAzure Databricks
Worked onAzure Databricksto use custom DNS and configurenetwork security group (NSG)rules to specify egress traffic restrictions.
Extensive experience on Azure HDInsight, Azure Cosmos DB, Azure Databricks, Azure Stream Analytics.
Loading data from different source (database & files) into Hive using Talend tool.
Implemented Spark using Python/Scala and utilizing Spark Core, Spark Streaming and Spark SQL for faster processing of data instead of MapReduce in Java
Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability
Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc.
Involved in running Hadoop Streaming jobs to process Terabytes of data
Used Rally for bug tracking and CVS for version control.

Environment: Hadoop, Map Reduce, Hive, HDFS, PIG, Sqoop, Hortonworks, Flume, HBase, Oracle, Snowflake, Teradata, Tableau Unix/Linux, Hadoop, Hive, PIG, SQOOP, Flume, HDFS, Oracle/SQL & DB2, Unix/Linux, Rally, Azure

Sr. Data Engineer

Confidential, Charlotte, NC

Responsibilities:

Performed data analysis and developed analytic solutions. Data investigation to discover correlations / trends and the ability to explain them.
Worked with Data Engineers, Data Architects, to define back-end requirements for data products (aggregations, materialized views, tables - visualization)
Developed frameworks and processes to analyze unstructured information. Assisted in Azure Power BI architecture design
Experienced with machine learning algorithm such as logistic regression, random forest, XGboost, KNN, SVM, neural network, linear regression, lasso regression and k - means
Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight BigData Technologies (Hadoop and Apache Spark) and Data bricks.
Experience in designing Azure Cloud Architecture and Implementation plans for hosting complex application workloads on MS Azure.
Implemented Statistical model and Deep Learning Model (Logistic Regression, XGboost, Random Forest, SVM, RNN, CNN).
Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions and Data Cleansing.
Responsible for importing data from PostgreSQL to HDFS, HIVE using SQOOP tool.
Maintained consistency of SAS variables formats and sorted various datasets using PROC SORT and merged them using MERGE
Created branches from withinJIRAandJIRA agileby integrating theBitbucketwithJIRA.
Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
Architected complete scalable data pipelines,datawarehousefor optimized data ingestion
CI-CD pipeline has been used to deploy the code to Production.
ETL: Employ tools like Azure SQL Server, Azure Data Factory and Data Bricks to create end-to-end data pipelines for collecting, cleansing and processing client data.
Deploy and manage Azure Data bricks instance for Data science team for analysis on promotions and new products.
Actively participated in all the Agile meetings, Scrum meetings and involved in all the Retrospective meetings.
Experience in developing pipelines in Azure Data Factory using SQL Azure.
Migrating current data center environment to Azure Cloud using tools like Azure Site Recovery (ASR).
Implementing changes to firewalls and VNet’s to comply with customer Security policies (i.e. Network Security Groups, etc.)
Enhancing customer’s capabilities around logging (centralizing log files and potentially using OMS or App Insights to look for problems and take action).
Working in base lining, capacity planning, Data Center Infrastructure & Network Designing, Windows Server Migrations.
Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
Build Data Sync job on Windows Azure to synchronize school’s data from SQL 2012, 2012R2, 2016 databases to SQL Azure.
Azure PaaS Solutions like Azure Web Apps, Web Roles, Worker Roles, SQL Azure and Azure Storage.
Developed standardized web reports using SAS BI suite oftools.
Experience in Converting existing AWS Infrastructure to Server less architecture (AWS Lambda, Kinesis), deploying via Terraform and AWS Cloud Formation templates.
Used Chef for configuration management of hosted Instances within GCP.
Building/Maintaining Docker/ Kubernetescontainer clusters managed by KubernetesLinux, Bash, GIT, Docker, on GCP
Architect and design serverless application CI/CD by using AWS Serverless (Lambda) application model.
Developed stored procedures/views in Snowflake and use in Talend for loading Dimensions and Facts.
Developed merge scripts to UPSERT data into Snowflake from an ETL source.

Environment: Hadoop,Java, Map Reduce, HDFS, Hive, Sqoop, Spring Boot, Cassandra, Swamp, Data Lake, Sqoop, Oozie, Kafka, Spark, Scala, Java, AWS, GitHub, Docker, Talend Big Data Integration, Solr, Impala, Oracle, Sql Server, MySQL, No SQL, MongoDB, Hbase, Cassandra,Unix, Shell Scripting

Big Data Engineer

Confidential

Responsibilities:

Integrated Azure Active Directory autantication to every Cosmos DB request sent and demoed feature to Stakeholders
Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
Integrated Azure Active Directory autantication to every Cosmos DB request sent and demoed feature to Stakeholders
Azure SynapseExperienced with version control tools likeGIT, CVS, Bitbucket, and SVN. In-depth knowledge of source controller concepts like Branches, Tags, and Merges.
Performed data extraction, transformation, loading, and integration in data warehouse, operational data stores and master data management
Strong understanding of AWS components such as EC2 and S3
Created yaml files for each data source and including glue table stack creation
Planning to move fromVCLOUD to GCP
Worked on a python script to extract data from Netezza databases and transfer it to AWS S3
Developed Lambda functions and assigned IAM roles to run python scripts along with various triggers (SQS, EventBridge, SNS)
Created a Lambda Deployment function, and configured it to receive events from S3 buckets
Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data .
Managed user account, groups and workspace creation for different users in Powercenter.
Wrote complex UNIX/windows scripts for file transfers, emailing tasks from FTP/SFTP.
Worked with PL/SQL procedures and used them in Stored Procedure Transformations.
Extensively worked on oracle and SQL server. Wrote complex sql queries to query ERP system for data analysis purpose
Worked on most critical Finance projects and had been the go to person for any data related issues for team members.
Migrated ETL code from Talend to Informatica. Involved in development, testing and post production for the entire migration project.
Tuned ETL jobs in the new environment after fully understanding the existing code.
Maintained Talend admin console and provided quick assistance on production jobs.
Involve in designing Business Objects universes and creating reports.
Built adhoc reports using stand-alone tables.
Involved in creating and modifying new and existing Web Intelligence reports.
Created Publications which split into various reports based on specific vendor.
Wrote Custom SQL for some complex reports.
Worked with business partners internal and external during requirement gathering.
Worked closely with Business Analyst and report developers in writing the source to target specifications for Data warehouse tables based on the business requirement needs.
Exported data into excel for business meetings which made the discussions easier while looking at the data.
Performed analysis after requirements gathering and walked team through major impacts.
Provided and debugged crucial reports for finance teams during month end period
Analyzed the system for new enhancements/functionalities and perform Impact analysis of the application for implementing ETL changes
Implemented a Continuous Delivery pipeline with Docker, and Git Hub and AWS
Built performant, scalable ETL processes to load, cleanse and validate data

Environment: AWS, Gcp,Java, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil, Bq Command Line Utilities, Dataproc, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql

Data Engineer

Confidential

Responsibilities:

Created sophisticated visualizations, calculated columns and custom expressions and developed Map Chart, Cross table, Bar chart, Tree map and complex reports which involves Property Controls, Custom Expressions.
Coordinating with source system owners, day-to-day ETL progress monitoring, Data warehouse target schema Design (Star Schema) and maintenance.
Investigated market sizing, competitive analysis and positioning for product feasibility. Worked on Business forecasting, segmentation analysis and Data mining.
Automated Diagnosis of Blood Loss during Emergencies and developed Machine Learning algorithm to diagnose blood loss.
Extensively used Agile methodology as the Organization Standard to implement the data Models. Used Micro service architecture with Spring Boot based services interacting through a combination of REST and Apache Kafka message brokers.
Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
Design and model datasets with Power BI desktop based on measure and dimension requested by customer and dashboard needs.
Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
Research on Reinforcement Learning and control (TensorFlow, Torch), and machine learning model (Scikit-learn).
Designs and develops the logical and physical data models to support the Data Marts and the Data Warehouse
Hands on experience in implementing Naive Bayes and skilled in Random Forests, Decision Trees, Linear, and Logistic Regression, SVM, Clustering, TEMPPrincipal Component Analysis.
Utilized Waterfall methodology for team and project management.
Used Git for version control with Data Engineer team and Data Scientists colleagues.

Environment: Spark, YARN, HIVE, Pig, Scala, Mahout, NiFi, TDD, Python, Spring Boot, Hadoop, Azure, Dynamo DB, Kibana, NOSQL, Sqoop, MYSQL.

Data Analyst / Hadoop Developer

Confidential

Responsibilities:

Imported Legacy data from SQL Server and Teradata into Amazon S3.
Created consumption views on top of metrics to reduce the running time for complex queries.
Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the history data from Teradata SQL to snowflake.
Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and Snowflake Databases for the Project
Worked on to retrieve the data from FS to S3 using spark commands
Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS
Created performance dashboards in Tableau/ Excel / Power point for the key stakeholders
Incorporated predictive modeling (rule engine) to evaluate the Customer/Seller health score using python scripts, performed computations and integrated with the Tableau viz.
Worked with stakeholders to communicate campaign results, strategy, issues or needs.
Analyzed marketing campaigns from various perspectives including CTR, conversion rates, seasonal/geographical trends, search queries, landing page, conversion funnel, quality score, competitors, distribution channel, etc. to achieve maximum ROI for clients.
Understood Business requirements to the core and came up with Test Strategy based on Business rules
Tested Hadoop Map Reduce developed in python, pig, Hive
Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like text file, CSV file.
Developed spark code and spark-SQL/streaming for faster testing and processing of data.
Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.

Environment: Snowflake, Hadoop, Map Reduce, Spark SQL, Python, Pig, AWS, GitHub, EMR, Nebula Metadata, Teradata, SQL Server, Apache Spark, Sqoop

We provide IT Staff Augmentation Services!

We'd love your feedback!

Resume Categories

Client Services

Job Seekers

Visa Sponsorship