Data Engineer Resume
Columbus, OH
PROFESSIONAL SUMMARY:
- Experience in Analysis, Design, Development and Implementation of Data Warehousing and other applications using Object Oriented Analysis, Design and Methodologies.
- Worked on various ETL projects involving Data Integration, Data Conversion, Data Quality by data profiling, parsing, cleansing, standardization and matching capabilities to make sure data is correct using tools like Business Objects SAP Data Services/Data Integrator, SSIS, Informatica, Talend.
- Business Intelligence experience with complete life cycle Implementations using Business Objects (Supervisor, Designer, Info view, Business Objects Reporting Business Objects SDK, Xcelsius, Web - Intelligence, Publisher, BO Set Analyzer) Quicksight, Qlikview, Tableau.
- Worked on data extraction, transformation and loading for different databases like Oracle, Teradata, DB2, Sybase ASE and MSSQL.
- Designed and developed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools like ERWIN and EMBARCADERO ER Studio.
- Experience with Hadoop Ecosystem (Spark, Sqoop, Hive, Flume, Hbase, Sqoop, Kafka, oozie).
- Worked on projects to migrate data from on premise databases to Redshift, RDS and S3.
- Designed and engineered on-premise to off-premise CI/CD docker pipelines (integration and deployment) with ECS, Glue, Lambda, ELK, Spark DataBricks, firehose and kinesis stream
- Used Python coding for data analysis using numpy and pandas and Big Data stacks such as pyspark.
- Responsible for Enterprise master data, metadata management, data quality, data governance Strategy planning and implementation.
- Created metadata management program with business and technical definitions and data lineage.
PROFESSIONAL EXPERIENCE:
Data Engineer
Confidential
Responsibilities:
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and Amplitude
- Keep data properly secure, including encryption, managing permission level, handling PII information.
- Create and maintain data pipeline architecture.
- Assemble large, complex data sets that meet business needs.
- Migrated on premise database structure to Confidential Redshift data warehouse Work on publishing interactive data visualizations dashboards, reports /workbooks on Tableau and Amplitude.
- Work on lambda functions written in python to filter and map data.
- Work with event driven and scheduled lambda functions to trigger various AWS resources.
- Extracting data from Teradata DB and load into Redshift and S3 using AWS Glue.
- Run automated Teradata DB scripts using pyspark in glue to load files in S3.
- Used python programming language for writing scripts to delete data in redshift to maintain a 62-day window for data retention.
- Used python in AWS glue for unnesting nested JSON data coming in from analytics SDK for iOS and android apps and loading data into AWS Redshift
- Developed Data lake using AWS Services including Athena, S3, Ec2, Glue and Quick sight.
- Used python in AWS glue to extract data from Teradata and run automated scripts daily to load data into S3 and redshift
- Used python to merge web data files coming in from adobe analytics and ingest data into
- Machine learning personalize recommendation model.
- Implemented machine learning models to generate card recommendations that can be created by user personalization of content using python and AWS personalize
- Implemented machine learning propensity model using AWS sage maker by ingesting care call data extracted from Teradata and loaded in an S3 bucket.
- Trained the machine learning propensity model to reduce the calls made to care.
- Responsible for Designing Logical and Physical data modelling for various data sources on Confidential Redshift.
- Using python to automate the process of inserting a new table in AWS redshift when the trigger in S3 bucket when a new SDK event is created.
- Used python on EMR to unrest JSON data and load it in a csv format in S3 Data Hub.
- Design and Development of AWS Glue jobs to extract data load data from JSON files into data mart in Redshift
- Created CI/CD data pipeline to ingest data into amplitude and AWS S3
Environment: AWS Redshift, AWS Data Pipeline, AWS S3, Python, AWS Glue, AWS Lambda, AWS CodePipleline, AWS Personalize, AWS Sage Maker, Teradata, Amplitude, RDS, Kinesis, Quick sight Adobe analytics Costco Corporate, Seattle, WA
Data Engineer
Confidential, Columbus, OH
Responsibilities:
- Working with functional analysts and Business Users to understand the requirements for data validation between source and legacy system and formulate reports for exception records for each interface.
- Responsible for managing BODS Coex Interface BODS validation Jobs from initial design to final implementation and deployment.
- Documenting the analysis of the existing ETL jobs with information about the data flow, tables, stored procedures, source, target and their equivalent sources.
- Defined database Data Stores to allow Data Services/Data Integrator to connect to the source or target database.
- Using Business Objects Data Services for ETL extraction, transformation and loading data into the staging database from heterogeneous source systems.
- Working on the Low-level detail design documents for all the validation interface jobs
- Verification of duplicate records with regard to the parent objects in order to take off all the duplicate records.
- Creating Scripts like starting script and ending script for each job, sending the job notification to users using scripts and declaring the Local and Global Variables.
- Verification of duplicate records with regard to the parent objects in order to take off all the duplicate records
- Comparing Source S/4 HANA and target legacy system Iseries tables after the data loaded in ISERIES through the COEX interface and flag the missing and mismatch records by creating validation tables as part of data validation process.
- Worked on developing scripts for data analysis using Python.
- Spun up HDInsight clusters and used Hadoop ecosystem tools like Kafka, Spark and databricks for real-time analytics streaming, sqoop, pig, hive and Cosmos DB for batch jobs.
- Migrating and testing the validation jobs for each interface process type in different instances and validating the data by comparing source and target tables.
- Automating all error handling, error escalation, and email notification procedures
- Performing Unit Testing and Integration testing for all the Interface validation BODS jobs.
- Building SAP IS scorecards and custom Business Objects reports for Data Quality reporting.
- Developer Power BI reports and dashboards from multiple data sources using data blending.
- Developing and executing advanced data profiling, data validation scenarios.
Data Engineer
Confidential
Responsibilities:
- Involved in gathering business requirements, functional requirements and data specification.
- Used Information Steward Data Insight to analyze data quality, data profiling and to define validation rules for data migration.
- Created Data Quality score cards with various quality dimensions defined by the rules using Information Steward.
- Developed BI Data lake using AWS Services including Athena, S3, Ec2, Glue and Quicksight.
- Designed and Developed ETL jobs to extract data from SQL server and load it in data mart in Redshift.
- Scripted advanced SQL queries in PostgreSQL for Redshift and MySQL leveraging Common Table.
- Expressions, sub-queries, correlated subqueries, window functions and complex join logic to apply business rules.
- Used Pyspark in AWS GLUE to convert files from CSV format to parquet format.
- Generated ad-hoc reports in Excel Power Pivot and shared them using Power BI to the decision makers for strategic planning.
- Worked publishing interactive data visualization dashboards, reports/workbooks on Tableau and Quicksight.
Environment: SAP BO Data Services 4.2, SAP Information Steward, AWS Redshift, AWS S3, AWS Data Pipeline, Python, Tableau, Quicksight, SAP ECC, SQL Server, SAP ISU, SAP HANA, Power BI
BI/Data Engineer
Confidential
Responsibilities:
- Designed, developed and implemented solutions with data warehouse, ETL, data analysis and BI reporting technologies.
- Extensively worked on Data Services for migrating data from one database to another database.
- Implemented various performance optimization techniques such as caching, Push-down memory-intensive operations to the database server, etc.
- Designed and Implement test environment on AWS.
- Created S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Transferred data l from AWS S3 to AWS Redshift.
- Responsible for designing, building, testing, and deploying Dashboards, Universes and Reports using Web Intelligence through Business Objects.
- Worked on developing Map-Reduce scripts in Python.
Environment: SAP BO Data Services 4.1, SAP Business Objects 4.1, SAP ECC, SAP Information Steward, Python, AWS S3, AWS Redshift, AWS Data Pipeline, SAP HANA, IBM DB2, Power BI
Confidential
Hadoop Engineer
Responsibilities:
- Installed and configured Apache Hadoop to test the maintenance of log files in Hadoop cluster.
- Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
- Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
- Setup and benchmarked Hadoop/Hbase clusters for internal use.
- Developed Java Map Reduce programs for the analysis of sample log file stored in cluster.
- Developed Simple to complex Map/reduce Jobs using Hive and Pig
- Developed Map Reduce Programs for data analysis and data cleaning.
- Developed PIG Latin scripts for the analysis of semi structured data.
- Developed and involved in the industry specific UDF (user defined functions)
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Used Sqoop to import data into HDFS and Hive from other data systems.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
- Developed Hive queries to process the data for visualizing.
Environment: Apache Hadoop, HDFS, Cloudera Manager, CentOS, Java, Map Reduce, Eclipse, Hive, PIG, Sqoop, Oozie and SQL.