Sr. Data Engineer Resume
Boston, MA
PROFESSIONAL SUMMARY:
- Over 8+ years of professional experience in IT, working with various Legacy Database systems, which include work experience in Big Data technologies as well.
- More than 5 Years of working experience as Data Engineer and 2+ years of working experience in Data Analysis.
- Good experience on understanding of architecting, designing and operation of large scale data and analytics solutions on Snowflake Cloud Data Warehouse.
- Experience in Requirement gathering, System analysis, handling business and technical issues & communicating with both business and technical users.
- Hands on experience on complete Software Development Life Cycle SDLC for the projects using methodologies like Agile and hybrid methods.
- Experience in analyzing data using Big Data Ecosystem including HDFS, Hive, HBase, Zookeeper, PIG, Sqoop, and Flume.
- Knowledge and working experience on big data tools like Hadoop, Azure Data Lake, AWS Redshift.
- Good understanding of Apache Airflow.
- Experience in workflow scheduling with Airflow, AWS Data Pipelines, Azure, SSIS, etc.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream - processing).
- Experience in Text Analytics, Data Mining solutions to various business problems and generating data visualizations using SAS and Python.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
- Experience in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
- Solid Excellent experience in creating cloud based solutions and architecture using Amazon Web services (Amazon EC2, Amazon S3, Amazon RDS) and Microsoft Azure.
- Experienced in Technical consulting and end-to-end delivery with architecture, data modeling, data governance and design - development - implementation of solutions.
- Experience in Big Data Hadoop Ecosystem in ingestion, storage, querying, processing and analysis of big data.
- Extensive working experience in agile environment using a CI/CD model.
- Extensive experience working with structured data using Spark SQL, Data frames, Hive QL, optimizing queries, and in corporate complex UDF's in business logic.
TECHNICAL SKILLS:
Big Data & Hadoop Ecosystem: MapReduce, Spark 2.3, HBase 1.2, Hive 2.3, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0ETL Tools: Informatica 10.1/9.6.1, (PowerCenter/PowerMart) (Designer, Workflow Manager, Workflow Monitor, Server Manager, Power Connect)
NoSQL DB: HBase, Azure SQL DB, Cassandra 3.11, Big Table
Reporting Tools: Power BI, Tableau and Crystal Reports 9
Cloud Platforms: AWS, EC2, EC3, Redshift MS Azure, Azure Synapse, ADF, Blob Storage, Azure Data Bricks, GCP, Big query, Google DSK.
Programming Languages: PySpark, Python, SQL, PL/SQL, UNIX shell Scripting, AWK
RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access
Operating Systems: Microsoft Windows 8 and 10, UNIX and Linux.
Methodologies: Agile, RAD, JAD, RUP, UML, System Development Life Cycle (SDLC), Waterfall Model.
WORK EXPERIENCE:
Confidential - Boston, MA
Sr. Data Engineer
Responsibilities:
- As a Data Engineer, assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
- Engaged in solving and supporting real business issues with your Hadoop distributed File systems and Open Source framework knowledge.
- Got involved in migrating on prem Hadoop system to using GCP (Google Cloud Platform).
- Performed detailed analysis of business problems and technical environments and use this data in designing the solution and maintaining data architecture.
- Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
- Designed and built data pipelines to load the data into GCP platform.
- Monitored Data Engines to define data requirements and data Accusations from both relational and non-relational databases.
- Designed efficient and robust Hadoop solutions for performance improvement and end-user experiences.
- Designed and architect various layer of Data Lake.
- Designed star schema in BigQuery.
- Used data integration to manage data with speed and scalability using the PySpark engine in Dataproc.
- Used rest API with Python to ingest Data from and some other site to BigQuery.
- Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and BigQuery tables.
- Developed custom python program including CI/CD rules for Google cloud data catalog for metadata management.
- Processed and load bound and unbound Data from Google pub/sub topic to BigQuery using cloud Dataflow with Python.
- Wrote a program to download SQL Dump site maintenance and then load it in GCS bucket.
- Loaded this SQL dump from GCS bucket to MYSQL (hosted in Google cloud SQL) and load the Data from MYSQL to BigQuery using Python, Scala, spark and Dataproc.
- Used g-cloud function with Python to load Data into BigQuery for on arrival CSV files in GCS bucket.
- Used rest API with Python to ingest Data from and some other site to BigQuery.
- Build a program with Python and apache beam and execute it in cloud Dataflow to run Data validation between raw source file and BigQuery tables.
- Designed and developed scalable data pipelines for optimal ingestion, transformation, storage and computation using latest big data technologies
- Developed Spark code using Scala for faster testing and processing of data.
- Monitored BigQuery, Dataproc and cloud Data flow jobs via Stackdriver for the entire environment.
- Worked with Google data catalog and other Google cloud API’s for monitoring, query and billing related analysis for BigQuery usage.
- Submitted spark jobs using gsutil and spark submission get it executed in Dataproc cluster
- Wrote a Python program to maintain raw file archival in GCS bucket.
- Wrote Scala program for spark transformation in Dataproc.
- Developed report using Tableau that keeps track of the dashboards published to Tableau Server, which help us find the potential future clients in the organization.
- Performed detailed analysis of business problems and technical environments and use this data in designing the solution and maintaining data architecture.
- Assist service developers in finding relevant content in the existing models.
- Active involvement in Business meetings and team meetings.
Technical Tools: Hadoop, G-cloud, BigQuery, GCS, MYSQL, Sql, Spark, Scala, Python, Dataproc, Apache Beam, Agile/Scrum Methodology
Confidential - Johnston, RI
Sr. Data Engineer
Responsibilities:
- Worked in Confidential as Data Engineer to collaborate with other Product Engineering team members to develop, test and support data-related initiatives.
- Assisted in leading the plan, building, and running states within the Enterprise Analytics Team.
- Lead the estimation, review the estimates, identify the complexities and communicate to all the stakeholders.
- Engaged in solving and supporting real business issues with your Hadoop distributed File systems and Open Source framework knowledge.
- Responsible for data governance rules and standards to maintain the consistency of the business element names in the different data layers.
- Built the data pipelines that will enable faster, better, data-informed decision-making within the business.
- Identified data within different data stores, such as tables, files, folders, and documents to create a dataset in pipeline using Azure HDInsight.
- Performed detailed analysis of business problems and technical environments and use this data in designing the solution and maintaining data architecture.
- Migrated on-primes environment on Cloud using MS Azure.
- Worked on migration of data from On-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Performed data flow transformation using the data flow activity.
- Performed ongoing monitoring, automation, and refinement of data engineering solutions.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Developed mapping document to map columns from source to target.
- Created azure data factory (ADF pipelines) using Azure polybase and Azure blob.
- Performed ETL using Azure Data Bricks.
- Wrote UNIX shell scripts to support and automate the ETL process.
- Worked on python scripting to automate generation of scripts. Data curation done using azure data bricks.
- Used data integration to manage data with speed and scalability using the Apache Spark engine in Azure Databricks.
- Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
- Designed efficient and robust Hadoop solutions for performance improvement and end-user experiences.
- Worked in a Hadoop ecosystem implementation/administration, installing software patches along with system upgrades and configuration.
- Performed Data transformations in Hive and used partitions, buckets for performance improvements.
- Continuously monitor and manage data pipeline (CI/CD) performance alongside applications from a single console with Azure Monitor.
- Ingested data into HDFS using Sqoop and scheduled an incremental load to HDFS.
- Worked with Hadoop infrastructure to storage data in HDFS storage and use HIVE SQL to migrate underlying SQL codebase in Azure.
- Extensively involved in writing PL/SQL, stored procedures, functions and packages.
- Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
- Performed Data scrubbing and processing with Apache NiFi and for workflow automation and coordination.
- Developed Simple to complex streaming jobs using Python and Hive.
- Optimized Hive queries to extract the customer information from HDFS.
- Involved in scheduling Oozie workflow engine to run multiple Hive jobs.
- Analyzed data using Hive the partitioned and bucketed data and compute various metrics for reporting.
- Built Azure Data Warehouse Table Data sets for Power BI Reports.
- Working on BI reporting with At Scale OLAP for Big Data.
- Developed customized classes for serialization and De-serialization in Hadoop.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
Technical Tools: Hadoop, Spark, Kafka, Azure Data Bricks, ADF, Python, PySpark, HDFS, ETL, Agile & Scrum meetings
Confidential - Eden Prairie, MN
Data Engineer
Responsibilities:
- As a Data Engineer I was responsible to build a data lake as a cloud based solution in AWS using Apache Spark and Hadoop.
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
- Objective of this project is
- Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
- Used AWS Cloud and On-Premise environments with Infrastructure Provisioning/ Configuration.
- Used EMR (Elastic Map Reducing) to perform big data operations in AWS.
- Used the Agile Scrum methodology to build the different phases of Software development life cycle.
- Worked on AWS Redshift and RDS for implementing models and data on RDS and Redshift and designed and implemented Near Real Time ETL and Analytics using Redshift.
- Designed and customizing data models for Data warehouse supporting data from multiple source on real time.
- Designed ETL strategies for load balance, exception handling and design processes that can satisfy high data volumes.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a server less data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Contributed to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop.
- Wrote complex Hive queries to extract data from heterogeneous sources (Data Lake) and persist the data into HDFS.
- Developed Big Data solutions focused on pattern matching and predictive modeling.
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data.
- Developed Spark jobs and Hive Jobs to summarize and transform data.
- Developed reconciliation process to make sureelasticsearchindex document count match to source records.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Implemented Sqoop to transform the data from Oracle to Hadoop and load back in parquet format
- Developed incremental and complete load Python processes to ingest data intoElasticSearchfrom oracle database
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Load the data through HBase into Spark RDD and implement in memory data computation to generate the output response.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
Technical Tools: Hadoop 2.7, Spark 2.7, Hive, Sqoop 1.4.6, AWS, HBase, Kafka 2.6.2, Python 3.6, HDFS, Elastic Search & Agile Methodology
Confidential - Greensboro, NC
Data Analyst
Responsibilities:
- Massively involved as Data Analyst role to review business requirement and compose source to target data mapping documents.
- Extensively used Agile Method for daily scrum to discuss the project related information.
- Provided a summary of the Project's goals, and the specific expectation of business users from BI and how it aligns with the project goals.
- Provided suggestion to implement multitasking for existing Hive Architecture in Hadoop also suggested UI customization in Hadoop.
- Installed and Configured Hadoop cluster using Amazon Web Services (AWS) for POC purposes.
- Connected to AWS Redshift through Tableau to extract live data for real time analysis.
- Developed Data mapping, Transformation and Cleansing rules for the Data Management involving OLTP and OLAP.
- Created Logical & Physical Data Model on Relational (OLTP) on Star schema for Fact and Dimension tables using Erwin.
- Used AWS S3 Buckets to store the file and injected the files into Snowflake tables using Snow Pipe and run deltas using Data pipelines.
- Worked on complex SNOW SQL and Python Queries in Snowflake.
- Developed and maintained data dictionary to create metadata reports for technical and business purpose.
- Resolve AML related issues to ensure adoption of standards, guidelines in the organization. Resolution of day-to-day issues and worked with the users and testing team towards resolution of issues and fraud incident related tickets.
- Used Erwin tool to develop a Conceptual Model based on business requirements analysis.
- Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target system from multiple sources
- Worked on normalization techniques. Normalized the data into 3rd Normal Form (3NF).
- Involved in Data profiling in order to detect and correct inaccurate data and maintain the data quality.
- Extensively used Star Schema methodologies in building and designing the logicaldata model into Dimensional Models.
- Updated Python scripts to match data with our database stored in AWS Cloud.
- Normalized the database based on the new model developed to put them into the 3NF of thedatawarehouse.
- Involved in extensiveDatavalidation by writing several complex SQL queries.
- Performeddatacleaning anddatamanipulation activities using NZSQL utility.
- Designed thedatamarts using the Ralph Kimball's DimensionalDataMart modeling methodology using Erwin.
- Implemented Snowflake schema to ensure no redundancy in the database.
- Worked with MDM system steam with respect to technical aspects and generating reports.
- Extracted Mega Data from Redshift AWS, and Elastic Search engine using SQL Queries to create reports.
- Executed change management processes surrounding new releases of SAS functionality
- Worked in importing and cleansing of data from various sources.
- Performed Data Cleaning, features scaling, features engineering using packages in python.
- Created and developed the stored procedures, triggers to handle complex business rules, history data and audit analysis.
- Worked on Unit Testing for three reports and created SQL Test Scripts for each report as required
- Used Informatica to extract transform & load source data from transaction systems.
- Created PL/SQL packages and Database Triggers and developed user procedures and prepared user manuals for the new programs.
- Created various types of data visualizations using Python and Tableau.
- Written and executed unit, system, integration and UAT scripts in a data warehouse project.
Technical Tools: Erwin 9.0, Agile, OLTP, OLAP, Snowflake, Snow Sql, AWS, EC2, MDM, SAS, SQL, PL/SQL.