Sr. Data Engineer Resume
Reston, VA
SUMMARY
- 6+ years of overall experience as a Data Engineer, Data Analyst, and ETL developer, with expertise in comprising designing, developing, and implementing data models for enterprise - level applications using Big Data tools and technologies such as Hadoop, Sqoop, Hive, Spark, Flume.
- Experienced in working with Azure cloud platforms (HDInsight, DataLake, Databricks, Blob Storage, Data Factory, Azure Functions, Azure SQL data warehouse, and Synapse).
- Proficient in migrating on-premises data sources to Azure data lake, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse using Azure Data factory and granting access to the users.
- Experienced in Developing Spark applications using Spark-SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.
- Experienced with Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, and Relational Data Ingestion.
- Experienced with the use of AWS services including S3, EC2, SQS, RDS, Neptune, EMR, Kinesis, Lambda, Step Functions, Terraform, Glue, Redshift, Athena, DynamoDB, Elasticsearch, Service Catalog, CloudWatch, IAM and administering AWS resources using Console and CLI.
- Hands-on experience in building the infrastructure necessary for the best data extraction, transformation, and loading from a range of data sources using NoSQL and SQL from AWS & Big Data technologies (Dynamo, Kinesis, S3, HIVE/Spark)
- Developed and deployed a variety of Lambda functions using the built-in AWS Lambda Libraries and Lambda functions written in Scala and using customlibraries.
- Capable of using AWS utilities such as EMR, S3, and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).
- Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
- Experienced in configuring Spark Streaming to receive real-time data from Apache Kafka and store the stream data to HDFS and expertise in using spark-SQL with various data sources like JSON, Parquet, and Hive.
- Extensively used Spark Data Frames API over the Cloudera platform to perform analytics on Hive data and used Spark Data Frame Operations to perform required Validations in the data.
- Expertise in developing production-ready Spark applications utilizing Spark-Core, Data Frames, Spark-SQL, Spark-ML, and Spark-Streaming API.
- Expert in using Azure Databricks to compute large volumes of data to uncover insights into business goals.
- Developed Python scripts to do file validations in Databricks and automated the process using Azure Data Factory.
- Experienced in integrating data from diverse sources, including loading nested JSON-formatted data into Snowflake tables, using the AWS S3 bucket and the Snowflake cloud data warehouse.
- Configured Snow pipe to pull the data from S3 buckets into the Snowflake table and stored incoming data in the Snowflake staging area.
- Expertise in developing Python scripts to build ETL pipelines and Directed Acyclic Graph (DAG) workflows using Airflow.
- Orchestration experience in scheduling Apache Airflow DAGs to run multiple Hive and spark jobs, which independently run with time and data availability.
- Strong experience working with Hadoop ecosystem components like HDFS, Map Reduce, Spark, HBase, Oozie, Hive, Sqoop, Pig, Flume, and Kafka.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and Hadoop MapReduce programming.
- Practical understanding of Data modeling (Dimensional & Relational) concepts like Star-Schema Modelling, Fact, and Dimension tables.
- Strong experience troubleshooting failures in spark applications and fine-tuning spark applications and hive queries for better performance.
- Strong experience writing complex map-reduce jobs including the development of custom Input Formats and custom Record Readers.
- Good exposure to usage of NoSQL databases column-oriented Cassandra and MongoDB (Document Based DB).
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.
- Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, and Spark SQL for Data Mining, Data Cleansing, Data Munging, and Machine Learning.
- Expertise in the development of various reports, and dashboards using various Power BI, and Tableau.
- Excellent Communication skills, Interpersonal skills, problem-solving skills, and a team player. Ability to quickly adapt to new environments and technologies.
TECHNICAL SKILLS
Hadoop Components: HDFS, Hue, MapReduce, Hive, Sqoop, Zookeeper, Flume, Oozie, Spark
Spark Components: Spark RDD, Data Frames, Spark SQL, PySpark, Spark Streaming
Databases: MySQL, MS SQL Server, Oracle, PostgreSQL, MongoDB
Programming Languages: Python, R, Scala. pyspark
ETL Tools: Informatica, Data Stage, SSIS, Cloud Services
AZURE: Azure Data Lake, Data Factory, Azure Databricks, Azure SQL Database, Azure SQL Datawarehouse
AWS: S3, EC2, REDSHIFT, REDIS, EMR, Kinesis, Lambda, Step Functions, Terraform, Glue, Athena, DynamoDB, Elasticsearch, Service Catalog, CloudWatch, and IAM, SNS, SQS, Athena, QuickSight
IDE: Python, PyCharm, Jupiter Notebook, Visual Studio Code
Build and CI tools: Azure DevOps, Git, Gitlab, GitHub, Big bucket, Bamboo
NoSQL Databases: HBase, Cassandra, MongoDB, Cosmos DB, DynamoDB
Data Visualization: Tableau, Power BI
PROFESSIONAL EXPERIENCE
Confidential, Reston, VA
Sr. Data Engineer
Responsibilities:
- Extensively worked with Azure cloud platform (HDInsight, Data Lake, Databricks, Blob Storage, Data Factory, Synapse, SQL, SQL DB, DWH, and Data Storage Explorer).
- Ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processed the data in In Azure Databricks.
- Designed and configured Azure Cloud relational servers and databases, analyzing current and future business requirements.
- Developed data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL.
- Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
- Developed robust ETL pipelines in Azure Data Factory (ADF) using Linked Services from different sources and loaded them into Azure SQL Datawarehouse.
- Developed Elastic pool databases and scheduled Elastic jobs to execute T-SQL procedures.
- Developed Spark applications in azure data bricks using Spark-SQL for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into customer usage patterns.
- Proficient in performing ETL operations in Azure Databricks by connecting to different relational database source systems using JDBC connectors.
- Migrated data from Azure Blob storage data to Azure Data Lake using Azure Data Factory (ADF).
- Developed the robust & scalable ETL Azure Data Lake to Data warehouse applications for Medicaid and Medicare data using the Azure data bricks.
- Built and automated data engineering ETL pipeline over Snowflake DB using Apache Spark and integrated data from disparate sources with Python APIs like PySpark and consolidated them in a data mart (Star schema).
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
- Worked on Power Shell scripts to automate the Azure cloud system creation of Resource groups, Web Applications, Azure Storage Blobs & Tables, and firewall rules.
- Designed and provisioned different Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Experience in working with Spark applications like batch interval time, level of parallelism, and memory tuning to improve processing time and efficiency.
- Orchestrate theairflowto migrate the data fromHiveexternal table to azureblob storageand optimized the existinghivejobs using the concepts like partition and bucketing.
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication, and Apache Ranger for authorization.
- Experience in working with Spark applications like batch interval time, level of parallelism, and memory tuning to improve processing time and efficiency.
- Used Scala for amazing concurrency support, and Scala plays a key role in parallelizing processing of the large data sets.
- Used Enterprise GitHub and Azure DevOps Repos for version control.
- Created branching strategies while collaborating with peer groups and other teams on shared repositories.
- Developed various interactive reports using Power BI based on Client specifications with row-level security features.
Environment: Azure (Data Lake, HDInsight, SQL, Data Factory), Databricks, Cosmos DB, Git, Blob Storage, Power BI, Scala, Hadoop, Spark, PySpark, Airflow.
Confidential, MD
Data Engineer
Responsibilities:
- Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, and IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Developed the AWS Data pipelines from various data resources in AWS including AWS API Gateway to receive responses from AWS Lambda and retrieve data and converted responses into JSON format and stored them in AWS redshift.
- Developed the scalable AWS Lambda code in Python for nested JSON files, converting, comparing, sorting, etc.
- Developed Spark Applications by using Scala and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Optimized the performance and efficiency of existing spark jobs and converted the Map-reduce script to spark SQL.
- Experienced in collecting data from an AWS S3 bucket in real time using Spark Streaming, doing the appropriate transformations and aggregations, and persisting the data in HDFS.
- Implemented AWS glue catalog with crawler to get the data from S3 and perform SQL query operations.
- Developed robust and scalable data integration pipelines to transfer data from the S3 bucket to the RedShift database using Python and AWS Glue.
- Built and maintained the Hadoop cluster on AWS EMR and has used AWS services like EC2 and S3 for small data sets processing and storage.
- Developed Python code for different tasks, dependencies, and time sensors for each job for workflow management and automation using the Airflow tool.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
- Designed the reports and dashboards to utilize data for interactive dashboards in Tableau based on business requirements.
Environment: AWS EMR, S3, EC2, Lambda, Apache Spark, Spark-Streaming, Spark SQL, Python, Scala, Shell scripting, Snowflake, AWS Glue, Oracle, Git, Tableau.
Confidential, Boston
ETL Developer
Responsibilities:
- Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor, and Repository Manager.
- Used Kafka for live streaming data and performed analytics on it. Worked on Sqoop to transfer the data from relational database and Hadoop.
- Configured in building real-time data pipelines using Kafka for streaming data ingestion and Spark Streaming for real-time consumption and processing.
- Loaded data from Web servers and Teradata using Sqoop, Flume, and Spark Streaming API.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Written multiple MapReduce programs for data extraction, transformation, and aggregation from multiple file-formats including XML, JSON, CSV & other compressed file formats.
- Extracted data from various heterogeneous sources like Oracle, and Flat Files.
- Developed complex mapping using the Informatica Power Center tool.
- Extracted data from Oracle and Flat files, Excel files, and performed complex joiner, Expression, Aggregate, Lookup, Stored procedure, Filter, Router transformation, and Update strategy transformations to load data into the target systems.
- Worked with Data modeler in developing STAR Schemas.
- Involved in analyzing the existence of the source feed in the existing CSDR database.
- Handling a high volume of day-to-day Informatica workflow migrations.
- Review Informatica ETL design documents and work closely with development to ensure correct standards are followed.
- Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, Log Files, Mappings, Sessions, and Workflows.
- Leveraging the existing PL/SQL scripts for the daily ETL operation.
- Experience in ensuring that all support requests are properly approved, documented, and communicated using the QMC tool. Documenting common issues and resolution procedures
- Extensively involved in enhancing and managing Unix Shell Scripts.
- In converting the business requirement into a technical design document.
- Documenting the macro logic and working closely with Business Analyst to prepare BRD.
- Involved in setting up SFTP setup with the internal bank management.
- Building UNIX scripts in cleaning up the source files.
- Involved in loading all the sample source data using SQL loader and scripts.
Environment: Informatica, Load Runner 8. x, HP QC 10/11, Toad, SQL, PL/SQL, Sqoop, Flume.
Confidential Blue, FL
Data Analyst- Python
Responsibilities:
- Experience working on projects with machine learning, big data, data visualization, R and Python development, Unix, and SQL.
- Performed exploratory data analysis using NumPy, matplotlib, and pandas.
- Expertise in quantitative analysis, data mining, and the presentation of data to see beyond the numbers and understand trends and insights.
- Experience analyzing data with the help of Python libraries including Pandas, NumPy, SciPy, and Matplotlib.
- Created complex SQL queries and scripts to extract and aggregate data to validate the accuracy of the data and Business requirements gathering and translating them into clear and concise specifications and queries.
- Prepared high-level analysis reports with Excel and Tableau. Provides feedback on the quality of Data including identification of billing patterns and outliers.
- Worked on sort & filters of tableau like Basic Sorting, basic filters, quick filters, context filters, condition filters, top filters, and filter operations.
- Identify and document limitations in data quality that jeopardize the ability of internal and external data analysts' ability; wrote standard SQL Queries to perform data validation; created excel summary reports (Pivot tables and Charts); and gathered analytical data to develop functional requirements using data modeling and ETL tools.
- Read data from different sources like CSV files, Excel, HTML pages, and SQL and performed data analysis and wrote to any data source like CSV file, Excel, or database.
- Experience in using Lambda functions like filter (), map (), and reduce () with pandas Data Frame and performing various operations.
- Used Pandas API for analyzing time series. Creating regression test framework for new code.
- Developed and handled business logic through backend Python code.
Environment: Python, SQL, UNIX, Linux, Oracle, NoSQL, PostgreSQL, and python libraries such as PySpark, and NumPy.