Sr. Data Engineer Resume
Bentonville, AR
SUMMARY
- Around 8 years of experience in IT with exceptional expertise in Big Data/Hadoop ecosystem and Data Analytics techniques.
- Hands on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming, PySpark, Hive, HDFS, Kafka, Redis, Sqoop, Oozie.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
- Experience in different Hadoop distributions like era and Horton Works Data Platform(HDP).
- In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node.
- Hands on experience in Importing and exporting data from RDBMS into HDFS and vice - versa using Sqoop.
- Experience in working with Hive data warehouse tool-creating tables, distributing data by doing static partitioning and dynamic partitioning, bucketing, and using Hive optimization techniques.
- Experience working with Cassandra and NoSQL database including MongoDB and HBase.
- Experience in tuning and debugging Spark application and using Spark optimization techniques.
- Experience in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.
- Extensive knowledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift, Glue and Athena.
- Experience in working with Azure cloud platform (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, ML Studio, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Experienced in data manipulation using python and python libraries such as Pandas, NumPy, SciPy and Scikit-Learn for data analysis, numerical computations, and machine learning.
- Experience in writing queries using SQL, experience in data integration and performance training.
- Developed various shell scripts and python scripts to automate Spark jobs and Hive scripts.
- Actively involved in all phases of data science project life cycle including Data collection, Data Pre-Processing, Exploratory Data Analysis, Feature Engineering, Feature selection and building Machine learning Model pipeline.
- Hands on Experience in using Visualization tools like Tableau, Power BI.
- Experience in working with GIT, Bitbucket Version Control System.
- Extensive experience working in a Test-Driven Development and Agile-Scrum Development.
- Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Bentonville, AR
Responsibilities:
- Created Pipelines in Azure ML Workspace and ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Created Azure ML pipelines with python module codes for production and user consumption
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
- Experienced in Azure Data Lake Storage Gen2 to store excel files, parquet files and retrieve using data using Blob API.
- Created Logic Apps with different triggers, connectors to integrate the data from Workday to different destinations
- Responsible for testing and fixing bugs of a monitoring application in which a user can create, start, stop or delete either of spark cluster, compute instance or compute cluster in Azure Databricks and Azure ML workspace respectively
- Worked on developing Restful end points to cache application specific data in in-memory data clusters like REDIS and exposed them with Restful endpoints.
- Create programs using NIFI workflows for various data ingestion into Hadoop Data Lake from MySQL, Postgres.
- Developed various solution driven views and dashboards by developing different chart types including Pie Charts, Bar Charts, Tree Maps, Circle Views, Line Charts, Area Charts, Scatter Plots in Power BI.
Environment: Azure Databricks, DataLake, MySQL, Azure ML, Azure SQL, PowerBI, Blob Storage, Data Factory, Data Storage Explorer, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Hive, BitBucket, Postgres, Presto DB, Redis, RabbitMQ.
Data Engineer
Confidential, New York, NY
Responsibilities:
- Developed Spark Applications by using Python and Implemented ApacheSpark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Performed tuning of Spark Applications to set batch interval time and correct level of Parallelism and memory tuning.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka inreal time and persist it to Cassandra.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
- Developed Kafka consumer's API in python for consuming data from Kafka topics.
- Used Kafka to consume XML messages and Spark Streaming to process the XML file to capture UI updates.
- Valuable experience on practical implementation of cloud-specific technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), ElastiCache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, Glue, EMR.
- Migrated an existing on-premises application to AWS and used AWS services like EC2 and S3 for small data sets processing and storage.
- Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster on AWS EMR.
- Used AWS EMR Spark cluster and Cloud Dataflow on GCPto compare the efficiency of a POC on a developed pipeline.
- Configured Snow pipe to pull the data from S3 buckets into Snowflakes tableand stored incoming data in the Snowflakes staging area.
- Created live real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.
- Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Designed, developed, deployed, and maintained MongoDB.
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming.
- Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems (RDBMS) and vice-versa.
- Written several Map reduce Jobs using Pyspark, Numpy and used Jenkins for Continuous integration.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries.
- Worked on cloud deployments using Maven, Docker, and Jenkins.
- Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
- Generated various kinds of reports using Power BI and Tableau based on Client specification.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, EC2,MapR, HDFS, Hive, Apache Kafka, Sqoop, Python, Scala, Pyspark, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, PowerBI, SOAP, Cassandra, and Agile Methodologies.
Big Data Engineer
Confidential, Dallas, TX
Responsibilities:
- Analyzing large amounts of datasets to determine optimal way to aggregate and report on these datasets.
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala and NoSQL databases such as HBase and Cassandra.
- Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop.
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Written AWS Lambda code in Python for nested Json files, converting, comparing, sorting etc.
- Created AWS Data pipelines using various resources in AWS including AWS API Gateway to receive response from AWSLambda and retrieve data from Snowflake using Lambda function and convert the response into Json format using database as Snowflake, DynamoDB, AWS Lambda function and AWS S3.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
- Developed Python code for different tasks, dependencies, and time sensor for each job for workflow management and automation using Airflow tool.
- Worked on cloud deployments using Maven, Docker and Jenkins.
- Create Glue jobs to process the data from S3 stating area to S3 persistence area.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
- Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
Environment: AWS EMR, S3, EC2, Lambda, MapR, ApacheSpark, Spark-Streaming, Spark SQL, HDFS, Hive, Apache Kafka, Sqoop, Flume, Python, Scala, Shell scripting, Linux, MySQL, HBase, NoSQL, DynamoDB, Cassandra, Machine Learning, Snowflake, Maven, Docker, AWS Glue, Jenkins, Eclipse, Oracle, Git, Oozie, Tableau, Power BI.
Data Engineer
Confidential
Responsibilities:
- Extensively worked with Azure cloud platform (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH and Data Storage Explorer).
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQLAzure Data Lake Analytics.
- Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Created Pipelines in Azure Data Factory (ADF) using Linked Services, Datasets, Pipeline to extract, transform and load data from different sources like Azure SQL, Blob storage, Azure SQLDW, write-back tool and backwards.
- Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
- Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.
- Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, firewall rules.
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
- Designed and deployed data pipelines using DataLake, Databricks, and Apache Airflow.
- Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.
- Developed Spark applications using PySpark and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
- Created and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
- Created data pipeline for different events in Azure Blob storage into Hive external tables. Used various Hive optimization techniques like partitioning, bucketing and Mapjoin.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
- Developed automatic job flows and ran through Oozie daily and when needed which runs MapReduce jobs internally.
- Extracted Tables and exported data from Teradata through Sqoop and placed in Cassandra.
Environment: Azure HDInsight, Databricks, DataLake, CosmosDB, MySQL, Azure SQL, Snowflake, MongoDB, Cassandra, Teradata, Ambari, Flume, Tableau, PowerBI, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, PySpark, Airflow, Hive, Sqoop, HBase, Oozie.
Hadoop DeveloperConfidential
Responsibilities:
- Installed and configured Hadoop Ecosystem like Hive, Oozie, Sqoop by which implemented using Cloudera Hadoop cluster for helping with performance tuning and monitoring.
- Understand Business requirement and involved in preparing Design document preparation according to client requirement.
- Analyzed Teradata procedure and imported all the data from Teradata to My SQL Database for Hive QL queries information for developing Hive Queries which consist of UDF’s where we don’t have some of the default functions in Hive.
- Converted complex oracle code into HQL and developed UDF in Hive to reflect some keyword in Hive like Pivot and Unpivot.
- Implemented Dynamic Partition and Bucketing in Hive as part of performance tuning for the workflow and co-ordination files using Oozie framework to automate tasks.
- Loaded and transformed large sets of structured, semi structured and unstructured data using PIG by importing data using Sqoop to load and export data from My SQL to HDFS and NoSQL Databases on regular basis for designing and developing PIG scripts to process data in a batch to perform trend analysis of data.
- Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Developed data pipelines using Sqoop, Pig and Hive to ingest customer member data, clinical, biometrics, lab and claims data into HDFS to perform data analytics.
- Developed Sqoop scripts to handle change data capture for processing incremental records between new arrived and existing data in RDBMS tables.
- Loaded the aggregated data onto Oracle from Hadoop environment using Sqoop for reporting on the dashboard.
- Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
- Worked close with DevOps team to understand, design and develop end to end flow requirements by utilizing Oozie workflow to do Hadoop jobs.
- Assisted with data capacity planning and node forecasting and collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.
Environment: Oozie, Cloudera Distribution with Hadoop (CDH4), MySQL, CentOS, Apache HBase, HDFS, MapReduce, Hue, Hive, PIG, Sqoop, SQL, Windows, Linux
SQL Developer
Confidential
Responsibilities:
- Gathered business requirements and converted them into new T-SQL stored procedures in visual studio for database project.
- Performed unit tests on all code and packages.
- Analyzed requirement and impact by participating in Joint Application Development sessions with business client online.
- Performed and automated SQL Server version upgrades, patch installs and maintained relational databases.
- Performed front line code reviews for other development teams.
- Modified and maintained SQL Server stored procedures, views, ad-hoc queries, and SSIS packages used in the search engine optimization process.
- Updated existing and created new reports using Microsoft SQL Server Reporting Services. Team consisted of 2 developers.
- Created files, views, tables and data sets to support Sales Operations and Analytics teams
- Monitored and tuned database resources and activities for SQL Server databases.