Sr Data Engineer Resume
Jackson, MI
SUMMARY
- Data Engineer with 9+ years of combined experience in the building data solutions using Azure Services like Azure SQL Db, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Databricks and Big Data implementations like Spark, Hive, Kafka and HDFS including programming languages such as Python, Scala and Java.
- Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm, Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and Hadoop Infrastructure.
- Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark and Hive.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience in working with Azure Cloud Space using Azure Components like Azure Data Factory, Azure Databricks, Azure Data Lake store, Azure storage accounts (Generation1 &Generation2), Logic Apps, Azure Key Vault.
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node.
- Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and knowledge on Spark MLLib.
- Hands on experience on Google Cloud Platform (GCP) in all the bigdata products BigQuery, Cloud Data Proc, Google Cloud Storage, Composer (Air Flow as a service).
- Strong experience in Business and Data Analysis, Data Profiling, Data Migration, Data Integration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Experience in implementing various Big Data Engineering, Cloud Data engineering, Data Warehouse, Data Mart, Data Visualization, Reporting, Data Quality, and Data virtualization Solution
- Experience in analyzing data using Python, R, SQL, Microsoft Excel,Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Minging and Machine Learning.
- Hands - on experience in Azure Cloud Services (PaaS & IaaS), Azure Synapse Analytics, SQL Azure, Data Factory, Azure Analysis services, Application Insights, Azure Monitoring, Key Vault, Azure Data Lake .
- Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in GCP and coordinate task among the team.
- Skilled in Azure cloud technologies like Azure Data Factory, Azure Databricks, Azure Data Lake Storage (ADLS), Azure Synapse Analytics, Azure SQL Database, Azure Analytical services, Apache Spark.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery, Azure Data Factory DataBricks
- I have experience in Big Data Technologies Hadoop, Azure Data Factory, Databricks with Python, Azure SQL, Data Lake Analytics and Data Lake Store.
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node.
- Experienced on Build, Deploying and managing SSIS packages with SQL server management studio, create SQL server agent jobs, configure jobs, configure data sources, and schedule packages through SQL server agent jobs.
- Extensive experience in developing complex Stored Procedures, Functions, Triggers.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Built data pipelines and applying data transformations for batch and real-time messaging systems.
- Assessing the effectiveness and accuracy of new data sources and data gathering techniques.
- Ability to work on various aspects of Data transformations and Data Modelling.
- Enhancing data collection procedures to include information that is relevant for building analytic systems.
- Experience indata cleansing and data mining.
- Have extensive experience in creating pipeline jobs, schedule triggers using Azure data factory.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Expertise in writing Map-Reduce Jobs in Python for processing large sets of structured, semi-structured and unstructured data sets and stores them in HDFS.
- Experience in Google Cloud components, Google container builders and GCP client libraries and cloud SDK’s
- Experience in data transformations using Map-Reduce, HIVE for different file formats.
- Strong experience in working with UNIX/LINUX environments, writing shell scripts.
- Worked with various formats of files like delimited text files, clickstream log files, Apache log files, Avro files, JSON files, XML Files.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Strong skills in analytical, presentation, communication, problem solving with the ability to work independently as well as in a team and had the ability to follow the best practices and principles defined for the team.
TECHNICAL SKILLS
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Sqoop, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.
Azure Services: Azure SQL Db, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Databricks
Hadoop Distribution: Cloudera distribution and Horton works
Programming Languages: Scala, Java
Script Languages: Python, Shell Script(bash,sh)
Databases: Oracle, MySQL, SQL Server, PostgreSQL, HBase, Snowflake, Cassandra, MongoDB
Operating Systems: Linux, Windows, Ubuntu, Unix
Web/Application server: Apache Tomcat,NetBeans
IDE: Intellij, Eclipse and NetBeans
Version controls and Tools: GIT, Maven, SBT, CBT
PROFESSIONAL EXPERIENCE
Confidential, JACKSON MI
SR DATA ENGINEER
Responsibilities:
- Implement solutions to run effectively in the cloud and improve the performance of big data processing and the high volume of data being handled by the system to provide better customer support.
- Work with business process managers and be a subject matter expert for transforming vast amounts of data and creating business intelligence reports using state-of-the-art big data technologies (Hive, Spark, Scoop, and NIFI for ingestion of big data, python/bash scripting /Apache Airflow for scheduling jobs in AWS cloud-based environments).
- Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java to perform event driven processing
- Assisted with the creation of Delta Lake tables and the execution of merging scripts to handle upserts.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift
- Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster processing of data.
- Developed Spark scripts by using Scala Shell commands as per the requirement.
- Used IAM to create new accounts, roles and groups and polices and developed critical modules like generating AWS resource numbers and integration points with S3, Dynamo DB, RDS, Lambda and SQS Queue
- Developed ETL parsing and analytics using Python/Spark to build a structured data model in Elastic search for consumption by the API and UI.
- Writing complex snowsql scripts in snowflake cloud data warehouse to business analysis and reporting.
- Worked extensively on migrating on prem workloads to AWS Cloud.
- Hands-on Experience with Looker Explores to query data, validate data accuracy, identify error sources and control content access for security.
- Involved in creating data ingestion pipelines for collecting health care and providers data from various external sources like FTP Servers and S3 buckets.
- Transform and analyze the data using Pyspark, HIVE, based on ETL mappings
- Involved in migrating existing Teradata Datawarehouse to AWS S3 based data lakes.
- Worked on various technologies like Azure Databricks, Azure Data Factory, Azure Synapse Analytics, HDInsight, and other technologies in Azure Services.
- Have In-depth knowledge of Databricks Architecture.
- Integrated Azure Databricks with ADLS Gen1, Gen2, Cosmos Db, EventHub’s, DevOps to analyze the data and to do transformations
- Deployed Databricks clusters using POST Man tool.
- Setup end to end environment based on the requirements.
- Assisted other team members towards the resolutions of the issues in Azure.
- Invested time to understand Databricks product from the ground up collaborating with our Databricks partner and engineering team.
- Created Azure Data Factory for copying data from Azure BLOB storage to SQL Server.
- Developed ADF Pipelines using the Data Factory Activities like Copy Data, Data Lake Analytics
- Created Azure Data Factory, setting up the integration Runtime to connect customer On-Prem environment to Azure Cloud.
- Worked on creating the Azure Data Factory Pipelines to process the files from On-Premises.
- Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
- Designed and architected scalable data processing and analytics solutions, Including technical feasibility, integration, development for Big Data storage processing and consumption of Azure data, analytics, big data (Hadoop, park), business intelligence (Reporting Services, Power Bl), NoSQL, HDInsight stream Analytics, Data Factory, Event Hubs, and Notification Hubs.
- Participated in daily stand-up meetings to update the project status with the internal Dev team.
- Writing Python scripts to load data from Web APIs to staging DB.
- Reverse-engineered existing data models to incorporate new changes utilizing Erwin.
- Developed artifacts that are consumed by the data engineering team such as source-to-target mappings, data quality rules, data transformation rules, Joins etc.
- Designed the Hadoop platform for high performance and low cost when compared with existing in-house data warehousing systems.
- Created Azure Data Factory Pipelines to move the data across different sources and sinks as per the requirement. Different sources include Azure Cosmos Db, Azure Storage Account, ADLS Gen1, ADLS Gen2
- Debugged and Resolved issues related to Azure Data Factory pipeline failures, Pipelines got stuck, Pipelines not producing output, Performance issues and deployment issues in Azure Data Factory.
- Created dynamic SQL script to transfer data from a multitude of tables between databases.
- Worked with database objects like stored procedures, user-defined functions, triggers and indexes using T-SQL to create complex scripts and batches.
- Built high-quality, reliable, and consistent sound systems that are aligned and scale with our data business needs.
- Created and executed complex T-SQL queries utilizing SQL Server Management Studio for back-end data validation.
- Utilized T-SQL queries and views based on business reporting requirements, performance tuning and various complex SQL query optimization.
- Resolve connectivity issues with IoT Hub, Event Hub and Blob Storage.
Environment: Azure SQL, Azure Data Factory v2, Azure Data Lake Storage Gen2, Azure Databricks, HDInsight, Python, Spark, Azure Storage, Azure IoT Hub, Azure Event Hub
Confidential, COLUMBUS, GA
SR. DATA ENGINEER
RESPONSIBILITIES:
- Involved in converting Hive/SQL queries into transformations using Python
- Performed complex joins on tables in hive with various optimization techniques
- Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
- Developed Python application for Google Analytics aggregation and reporting and used Django configuration to manage URLs and application parameters.
- Initiated the development and implementation of website user clickstream data analytics in Hadoop/Hive.
- Worked extensively with HIVE DDLS and Hive Query language(HQLs)
- Involved in loading data from edge node to HDFS using shell scripting.
- Understand and manage Hadoop Log Files.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Private Cloud (VPC), Cloud Formation, Lambda, Cloud Front, Cloud Watch, IAM, EBS, Security Group, Auto Scaling, Dynamo DB, Route53, and Cloud Trail.
- Worked withImpalafor massive parallel processing of queries for ad-hoc analysis. Designed and developed complex queries using Hive and Impala for a logistics application.
- Worked on Bigdata technologies like Hive, Impala, HDFS, Oozie workflows for ingesting data from different sources to audit layer and tan to harmonized layer
- CreatedBash scriptsto add dynamic partitions toHivestaging tables. Responsible for loading bulk amount of data intoHBaseusing MapReduce jobs.
- Involved in various Transformation and data cleansing activities using various Control flow and data flow tasks in SSIS packages during data migration
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators both old and newer operators.
- Work with business process managers and be a subject matter expert for transforming vast amounts of data and creating business intelligence reports using the state-of-the-art big data technologies (Hive, Spark, Scoop, and NIFI for ingestion of big data, python/bash scripting /Apache Airflow for scheduling jobs in GCP/Google’s cloud-based environments).
- Manage Hadoop infrastructure with Cloudera Manager.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Importing and exporting data into HDFS and Hive using Sqoop and also used flume from to extract from multiple resources.
- Analyzed the data to report patterns and trends that enabled business teams to make informed Structured and led customer feedback analysis projects in error detection Using Python libraries.
- Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
- Used data analysis to validate business rules and identify low-quality missing data in the data warehouse.
- Performed data analysis to find anomalies and ensure data integrity before consolidating data.
- Responsible for analysis and functional specifications of “As-Is” and “To-Be” business models to conduct GAP analysis in comparison to the proposed system to enhance the functionality of the application.
- Used data analysis techniques to validate business rules and identify low-quality missing data in the existing Enterprise Data Warehouse.
- Performed data analysis on test execution & defect data across projects to identify gaps and potential areas of improvement - exploratory data analysis, factor analysis, sentiment analysis, and text mining.
- Fine-tuned SQL queries using show plans and execution plans for better performance.
- Created data mapping documents for incorporating the ETL strategy utilizing SSIS.
- Involved in Cube Partitioning, refresh strategy, and planning dimensional data modeling in SSAS using DAX.
- Administrated and scheduled data-driven mailing subscriptions on SSRS reporting web portal.
Confidential, Manassas, VA
Data Engineer
RESPONSIBILITIES:
- Analyse, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
- Implemented Proof of concepts for SOAP & REST APIs
- REST APIs to retrieve analytics data from different data feeds
- Developed various shell scripts and python scripts to address various production issues.
- Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.
- Designed and maintained databases using Python and developed Python based API (RESTful Web Service) using Flask, SQL Alchemy, PLSQL and PostgreSQL.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster.
- Enhanced scripts of existingPythonmodules. Worked on writing APIs to load the processed data toHBasetables.
- ImplementedApache Airflowfor authoring, scheduling and monitoring Data Pipelines
- Hands-on experience on developing SQL Scripts for automation purpose.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
Environment: Azure PaaS, T-SQL, Spark SQL, Azure Data Lake, Python, SOAP, REST, Pyspark, Azure SQL, SQL Server, VSTS.
Confidential
Data Engineer
Responsibilities:
- Analyse, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
- Implemented Proof of concepts for SOAP & REST APIs
- REST APIs to retrieve analytics data from different data feeds
- Developed various shell scripts and python scripts to address various production issues.
- Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.
- Designed and maintained databases using Python and developed Python based API (RESTful Web Service) using Flask, SQL Alchemy, PLSQL and PostgreSQL.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analysing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark data bricks cluster.
- Enhanced scripts of existingPythonmodules. Worked on writing APIs to load the processed data toHBasetables.
- ImplementedApache Airflowfor authoring, scheduling and monitoring Data Pipelines
- Hands-on experience on developing SQL Scripts for automation purpose.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
Environment: Azure PaaS, T-SQL, Spark SQL, Azure Data Lake, Python, SOAP, REST, Pyspark, Azure SQL, SQL Server, VSTS.
Confidential
SQL Data Analyst
RESPONSIBILITIES:
- Designed and created Data Marts in data warehouse database Implementations of MS SQL Server Management studio 2008 to create Complex Stored Procedures and Views using T-SQL.
- Collecting the data from many resources and converting into flat text files with comma delimiter separator and importing the data to the SQL server for data manipulations.
- Responsible for deploying reports to Report Manager and Troubleshooting for any errors during the execution.
- Scheduled the reports to run on daily and weekly basis in Report Manager and emailing them to director and analyst to be reviewed in Excel Sheets.
- Created several reports for claims handling which had to be exported out to PDF formats.
- Analyzed business requirements and provided excellent and efficient solutions.
- Contributed to JRD sessions to gather requirements and define functional requirements.
- Utilized Python for recurring reports automation and visualized them on the BI platform.
- Worked collaboratively within and across development and project teams in a fast-paced work environment, utilizing Agile BI design.
- Designed complex T-SQL queries and user-defined functions in SQL server.
- Utilized T-SQL queries and views based on business reporting requirements, performance tuning, and various complex query optimization.
- Implemented dynamic SQL to develop customizable queries, answerable by the OLTP Server.
- Removed duplicate records by cleansing data utilizing ranking functions and CTEs in SQL.
- Created ETL package for data conversion using various transformations tasks.
- Worked on Dimensional modeling, Data migration, Data cleansing, and ETL Processes for data warehouses.
- Incorporated SSIS to load data into the data warehouse’s data mart with star schemas.
- Created summary and detail reports utilizing drill down and drill through functionalities in SSRS.
- Optimized SQL queries with execution plans to enhance performance.
- Assisted in the profiling of the legacy database tables using T-SQL queries to provide data cleansing issue reports to the client.
Environment: SQL Server 2008, Microsoft Visual Studio 2008, MS Office, T-SQL, ETL, SSIS, SQL Profiler, Erwin, SSMS, SSDT, SSRS, Agile, Python.
