Sr. Data Engineer Resume
Weehawken, NJ
SUMMARY
- Overall 7+ years of experience as a Data Engineer and Data Analyst, with an expertise in Data Mapping, Data Validation, and dealing with statistical data analysis such as transforming business requirements into analytical models, designing algorithms, machine learning, and strategic solutions that scale across massive volumes of data. Experience with a various BIGDATA technologies, tools, and databases Spark, Hive, python, SQL, AWS, Snowflake, Hadoop, Sqoop, CDL(Cassandra), Teradata, Tableau, and Redshift, but always making sure of living in the world, I cherish most i.e., DATA WORLD.
- Strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification, and Testing as per Cycle in both Waterfall and Agile methodologies.
- Strong experience in writing scripts using Python API, Spark API, and Spark API for analyzing the data.
- Involved in setting up Jenkins Master and multiple slaves for the entire team as a CI tool as part of Continuous development and deployment process
- Installed and configured Apache Airflow for workflow management and created workflows in python, created the DAG’s using Airflow to run jobs sequentially
- Involved in converting Hive Queries into various Spark Actions and Transformations by Creating RDD and Data frame from the required files in HDFS.
- Migrated an existing on - premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR.
- Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight, Big data Technologies (Hadoop and Apache Spark), and Data bricks.
- Hands-On experience on Spark Core, Spark SQL, Spark Streaming, and creating the Data Frames handle in SPARK with Scala.
- Experience in NoSQL databases and worked on table row key design and to load and retrieve data for real-time data processing and performance improvements based on data access patterns.
- Strong experience in the Analysis, design, development, testing, and implementation of Business Intelligence solutions using Data Warehouse/Data Mart Design, ETL, BI, Client/Server applications and writing ETL scripts using Regular Expressions and custom tools (Informatica, Pentaho, and Sync Sort) to ETL data.
- Strong experience in migrating other databases to Snowflake. In-depth knowledge of Snowflake Database, Schema and Table structures.
- Hands on Experience with dimensional modeling using star schema and snowflake models. Experienced in Optimizing the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- Firm understanding of Hadoop Stack architecture and various components including HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming.
- Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python technologies.
- Defined user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
- Good understanding of web design based on HTML5, CSS3, and JavaScript.
- Deep understanding of MapReduce with Hadoop and Spark. Good knowledge of Big Data ecosystem like Hadoop 2.0 (HDFS, Hive, Pig, Impala), Spark (Spark SQL, Spark Streaming).
- Excellent performance in building, publishing customized interactive reports and dashboards with customized parameters including producing tables, graphs, listings using various procedures and tools such as Tableau and user-filters using Tableau.
PROFESSIONAL EXPERIENCE
Confidential, Weehawken, NJ
Sr. Data Engineer
Responsibilities:
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Experience utilizing Azure Databricks, Azure SQL, PostgreSQL, and SQL Server to Extract, Transform, and Load data from a wide variety of sources into target databases
- Used Spark SQL to create ETL solutions in Azure Databricks for data extraction, transformation, and aggregation from a variety of file formats and data sources for analyzing and transforming the data to reveal insights on client usage patterns.
- Hands-on working knowledge of creating and deploying Databricks-based big data analytics apps on the Azure platform.
- Experience in using Databricks to create and optimize the Data Analytics system on Azure.
- Migrated some of the existing pipelines to Azure Databricks using PySpark Notebooks for the analytical team.
- Performed systems integration design and development in cloud architecture design (Azure)
- Created the automated build and deployment process for application, application setup for better user experience, and leading up to building a continuous integration system.
- Worked on developing a Pyspark script to encrypt the raw data by using Hashing algorithms concepts on client-specified columns.
- Worked on analyzing the Hadoop Stack using different big data analytic tools including Pig, Hive, and Map Reduce.
- Worked with the Data Science team to gather requirements for various data mining projects.
- Worked with different source data file formats like JSON, CSV, TSV, etc.
- Experience in importing data from various data sources like MySQL and Netezza using Sqoop, and SFTP performed transformations using Hive, Pig and loaded data back into HDFS.
- Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema into Hive ORC tables.
- Performed transformations, cleaning, and filtering on imported data using Hive, Map Reduce.
- Azure PaaS Solutions like Azure Web Apps, Web Roles, Worker Roles, SQL Azure, and Azure Storage.
- Experience in loading the tables from Azure Data Lake to Azure blob storage for pushing them to Snowflake.
- Developed ETL pipelines into and out of the data warehouse and created significant financial and regulatory reports utilizing Snowflake's sophisticated SQL queries.
- Processed and loaded various gold layer tables from Delta Lake into Snowflake
- Worked on integrating Azure Databricks with Snowflake.
- Hands-on experience in implementing Apache Airflow workflow for organizing and scheduling Hadoop workloads.
- Implemented Apache Airflow for authorizing, scheduling, and monitoring data pipelines.
- Worked on partitioning and used bucketing in HIVE tables and setting tuning parameters to improve the performance
- Used Agile methodologies - Scrums, Sprints, tracking of tasks using JIRA management tool.
- Converted applications that were on MapReduce to PySpark which performed the business logic.
- Very Good experience working in Azure Databricks, Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight Big data Technologies (Hadoop Stack and Apache Spark).
- Designing and Developing Oracle PL/SQL and Shell Scripts, Data Import/Export, Data Conversions, and Data Cleansing.
- Responsible for importing data from PostgreSQL to HDFS, HIVE using SQOOP tool, HBase using Spark.
- Developing Spark applications using Scala and Spark-SQL for data extraction, transformation, and aggregation from multiple file formats. Using Kafka and integrating with Spark Streaming. Developed data analysis tools using SQL and Python code.
Environment: Azure Databricks, Spark, Hive, HBase, Sqoop, Flume, MapReduce, HDFS, SQL, Apache Kafka, Apache Airflow, Snowflake, Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Python.
Confidential, Columbus, OH
Data Engineer
Responsibilities:
- Implemented Big data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive for ingesting data from diverse sources and processing Data-at-Rest.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDDs.
- Used Hadoop technologies like spark and hive Including using the PySpark library to create spark data frames and converting them to normal Pandas data frames for analysis.
- Used various python libraries like Pandas, NumPy, Matplotlib, SciPy, etc.,
- Extensively performed large data read/write to and from CSV and Excel files using pandas.
- Perform Data cleaning, features scaling, and features engineering using Pandas and NumPy packages in python.
- Handled multiple operations of data sets such as sub setting, slicing, filtering, group by, re-ordering, and re-shaping using python libraries like Pandas.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage and experienced in maintaining the Hadoop cluster on AWS EMR. Designed and build a Data Lake using Hadoop and its ecosystem components.
- Developed Airflow jobs to ingest data from RDBMS Systems like Teradata and Oracle database into S3 buckets.
- Experienced in configuring Apache Airflow for S3 bucket and Snowflake data warehouse and created DAGS to run the Airflow
- Automatically scale-up the EMR instances based on the data. And stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift.
- Worked extensively with AWS services like EC2, S3, VPC, ELB, Auto Scaling Groups, Route 53, IAM, CloudTrail, CloudWatch, CloudFormation, CloudFront, SNS, and RDS.
- Developed Python scripts to parse XML, and JSON files and load the data in AWS Snowflake Data warehouse.
- Experience integrating data from various source systems, including importing nested JSON formatted data into Snowflake tables, using cloud data warehouses like Snowflake and AWS S3 buckets.
- ETL development using EMR/Hive/Spark, Lambda, Scala, DynamoDB Streams, Amazon Kinesis Firehose, Redshift and S3.
- Work on developing events-based data processing pipeline using AWS Lambda, SNS and DynamoDB streams
- Worked with data investigation, discovery, and mapping tools to scan every single data record from many sources.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Extracted and exported the data from Teradata into HDFS using Sqoop.
- Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
- Monitoring and managing the Hadoop cluster through Cloudera Manager.
- Installed Oozie workflow engine to run multiple Hive
- Developed Hive queries to process the data and generate the data cubes for visualizing.
- Coordinate with the business users in providing an appropriate, effective, and efficient way to design the new reporting needs based on the user with the existing functionality.
Environment: Spark, Hive, HBase, Sqoop, Cosmos DB, MapReduce, HDFS, Cloudera, SQL, Apache Kafka, AWS, Lambda, Redshift, Athena, S3, Kubernetes, Python, Pandas, NumPy, Boto3, Unix
Confidential, Washington, PA
Data Engineer
Responsibilities:
- Installed, configured and maintained Data Pipelines, Developed Data Pipeline with Kafka and Spark.
- Developed processes for loading the data into snowflakes. Designed data modeling on the data and joined them with other DIM tables DataStage for tableau reporting.
- Implementation of Azure cloud solution using HDInsight, Event Hubs, CosmosDB, cognitive services and KeyVault.
- Ingest and Prep business ready data by building ELT/ETL data pipelines using Azure Data Factory, Azure Databricks (Spark, Scala, Python) into Azure SQL Data Warehouse.
- Authored Python (PySpark) Scripts for custom UDF's for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
- Evaluated Snowflake Design considerations for any change in the application, and built the Logical and Physical data model for snowflake as per the changes required.
- Redesigned the Views in snowflake to increase the performance and Unit tested the data between and Snowflake.
- Developed data warehouse model in snowflake for over 100 datasets using whereScape and Created Reports in Looker based on Snowflake Connections
- Developed solutions to leverage ETL tools and identify opportunities for process improvements using Scheduling tool and Python.
- Worked on scheduling all jobs using Airflow scripts using python added different tasks to DAG, LAMBDA, DataStage.
- Designed and implemented for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
- Designed and Developed data flows (streaming sources) using Azure Databricks features available.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
- Used AWS services like EC2 and S3 for small data sets processing and storage. Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS, SQL Server using Python.
Environment: AWS, EC2, ETL, Pyspark, Snowflake, Kafka, Spark, DataStage, Lambda, Hadoop, Tableau, Python, Hive, SQL, Oracle, scheduling tool, Shell scripting.
Confidential
Big Data Engineer
Responsibilities:
- Analyzed and gathered business requirements specifications by interacting with client and understanding business requirement specification documents.
- Worked in an Agile environment for the project, tools used were JIRA, GIT.
- Following up with the activities on JIRA regarding issues, reports, related documents, coordinating with the development, business, and testing teams from various locations.
- Creating Scrum projects, discussing about sprints to be added, standup meetings, burn down charts, summary reports.
- Used GIT to maintain repository, creating and merging branches, commit changes, checking out, moving, and removing files.
- Created data models, stored procedures, queries for data analysis and manipulations, views, functions. Maintain, upgrade databases and creating backups in SQL.
- Analyzed the client’s snapshot pages in the web interface in HTML and CSS to spot inconsistencies.
- Developed automated python scripts for repetitive task like delimiters splitting, characters joining, stray values filtering, date and data format conversions, regex operations (like code matching, replacing, pattern matching).
- The data received after all the tests were done would be parsed to see that there are no inconsistencies and save the data to the database.
- Involved in importing and exporting data from local and external file system and RDBMS to HDFS
- Managed datasets using Panda data frames and MySQL, queried MYSQL database queries from python using Python-MySQL connector and MySQL dB package to retrieve information.
- Designed and developed a data management system using MySQL.
- Managed large data sets using Pandas data frames and MYSQL.
- Responsible for debugging and troubleshooting the web application. Automated most of the daily task using python scripting.
Environment: Python, Flask, Azure, SDLC, GIT, Agile, MySQL, RDBMS, SOAP, Shell Script, HTML, CSS, JIRA.
Confidential
Data Engineer
Responsibilities:
- Used MS Excel, MS Access, and SQL to write and run various queries.
- Worked extensively on creating tables, views, and SQL queries in MySQL.
- Worked with internal architects and assisted in the development of current and target state data architectures.
- Perform troubleshooting, fixed, and deployed many Python bug fixes of the two main applications that were the main source of data for both customers and the internal customer service team.
- Write Python scripts to parse JSON documents and load the data in the database.
- Generating various capacity planning reports (graphical) using Python packages like NumPy, and MatPlotLib.
- Analyzing various logs that are been generating and predicting/forecasting the next occurrence of an event with various Python libraries.
- Performed Exploratory Data Analysis, trying to find trends and clusters.