Sr. Data Engineer Resume
Indianapolis, IN
SUMMARY
- Over 7+ years of experience in Data Engineer, worked on development of client/server and multi tired applications using Python, PySpark, Spark, Hadoop, HDFS, Hive, AWS, Oracle Database, SQL, PL/SQL and T - SQL on different platforms like Windows, UNIX and LINUX.
- IT experience in Data Warehousing, Data Analysis, ETL, BI and Business Analytics.
- Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
- Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3, Psycopg, embedPy, NumPy and BeautifulSoup.
- Manipulated data from Hadoop, Spark, NoSQL Database Mongo DB. Used Dockers to move the data into the containers as per the business requirements.
- Have experience in developing end to end Spark applications using PySpark to perform various data cleansing, validation, transformation and summarization activities according to the requirements.
- Experienced in developing Python code to retrieve and manipulate data from AWS Redshift, Oracle 11g/12c, T-SQL, MongoDB, MS SQL Server, Excel and Flat files.
- Experienced in configuring AWS environment to extract data from various sources and loaded in data in Redshift columnar database using distribution and sorting.
- Expertise in Data Extraction, Cleansing, Transformation, Integration, Data Analysis, Logical/Physical Relational/Dimensional Database Modeling & Designing
- Experience in designing and developing dashboards and reports by extracting data from different sources like oracle, flat files, and excel by using Tableau and Power BI.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, Mongo DB using Python.
- Extensive experience in the database analysis, development and maintenance of business applications using Oracle 12c,11g,10g, 9i, 8i, PL/SQL Developer.
- Expertise in developing SQL, PL/SQL scripts in oracle distributed environment to data analysis.
- Experience in Database design using Normalization and E/R diagrams.
- Extensively worked on PL/SQL Object Types, Dynamic SQL, Collections, Autonomous transaction, Compound triggers, Materialized Views and Table Partitioning.
- Exposure to different types of testing like Automation testing, System & Integration testing.
- Functional testing, Regression testing, Smoke testing, Database testing, Performance testing,
- Wrote simple and complex SQL queries using DML, DDL, Table joins, Group functions, Grouping functions, Analytical functions, Partition by clause towards Reports and Application development. Extensively used oracle tools like TOAD and SQL Navigator.
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Hands on experience with different AWS services like S3, EC2, EMR, SNS, SQS, Lambda, Redshift, Data pipeline, Athena, AWS Glue, S3 Glacier, Cloud Watch, Cloud Formation, IAM, AWS Single Sign-On, Key Management Service, AWS Transfer for SFTP, VPC, SES, Code Commit, Code Build.
- Expertise AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability and flexibility.
- Aware with VBA coding and reverse engineered excel macros to prepare requirement documents. Experience in Data Extraction, Migration Transformation and Error handling.
- Knowledge in Data modeling for Data Warehouse/Data Mart development, Data Analysis for Online Transaction Processing (OLTP) and Data Warehousing (OLAP)/Business Intelligence (BI) applications created. Experience in writing complex SQL Queries, created materialized views.
- Experience in AWS Cloud platform and its features which includes EC2, AMI, EBS Cloud watch, AWS Config, Auto-scaling, IAM user management, and AWS S3
- Experience in Developing Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying
- Excellent understanding of the software development lifecycle (SDLC).
- Experience with Agile tools including Rally, Kanban and JIRA Agile.
- Excellent communication and interpersonal skills, with the ability to manage responsibilities individually or as part of a team environment.
PROFESSIONAL EXPERIENCE
Sr. Data Engineer
Confidential, Indianapolis, IN
Responsibilities:
- Responsibilities include analyzing, trouble shooting, resolving and documenting reports. Understand existing system business logic, perform enhancement and impact analysis of the applications.
- Document all the requirements received from business team, optimized for effective solutions.
- Coordinate with business analyst, business client and various functional teams across the application to gather requirements and provide best solution approach.
- Developed ETL programs in python to move data from source systems to analytics area.
- Used python to retrieve and manipulate data from AWS Redshift, Oracle 11g/12c, MongoDB, T-SQL, MS SQL Server, Excel and Flat files.
- Demonstrated to move data between production systems and across multiple platforms.
- Once the data was dumped in to the analytics area, identified the business requirements to transform the data for analytics purposes based on the business requirements.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, MongoDB, T-SQL, and SQL Server using Python.
- Used python Libraries like Pytest, PyMongo, cxOracle, PyExcel, Boto3, Psycopg, SOAP, embedPy NumPy and BeautifulSoup based on the modules and business requirement.
- Configured AWS environment to extract data from various sources and loaded the data in Redshift using distribution and sorting.
- Used SFTP to transfer the raw-data files from the source system to AWS S3.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a server less data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Maintained ELK (Elastic search, Kibana) and Wrote Spark scripts using Scala shell.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Create complex data models and process flow diagrams. Create and maintain high level design (HLD), detailed design (DD), Unit test (UTP) documentations & business-process documentation based on the requirements gathered from business analysts & business user, using industry-standard methodology.
- Development/enhancement of Oracle PL/SQL programs for creating Tables, Views, Sequences, Database triggers, Cursors, Stored Procedures and Functions, Exception Handling and Indexing, Optimization and Tuning of Procedures, SQL Queries to improve performance.
- Worked on setting up the life cycle policies to back the data from AWS S3 to AWS Glacier, Worked with various AWS, EC2 and S3 CLI tools.
- Developed python programs and excel functions using VB Script to move data and transform data.
- Development/enhancement of UNIX shell scripts. Troubleshoot the Production issues.
- Developed data analysis tools using SQL and Python code.
- To meet specific business requirements wrote UDF's in Scala and Store procedures
- Replaced the existing MapReduce programs and Hive Queries into Spark application using Scala
- Work closely with upper management and consultants onshore and offshore of various teams in development, maintenance, QA & Testing and production support of compensation system.
- Design, Development and Enhancements of various types of reports.
- Working on reporting converting and making reports using python (PyExcel Module).
- Worked in data quality, data organization, Meta data and data profiling.
Environment: Python, R/R studio, SQL, CVS/XML Files, Oracle, JSON, Cassandra, MongoDB, AWS, Snowflake, Hadoop, Hive, MapReduce, Scala, Spark, J2EE, Agile, Apache Avro, Apache Maven, Kafka, MLlib, regression, Docker, Tableau, Git, Jenkins.
Data Engineer
Confidential, Westlake, TX
Responsibilities:
- Developed applications of Machine Learning, Statistical Analysis, and Data Visualizations with challenging data Processing problems in sustainability and biomedical domain.
- Analyzed the requirements and designed the program flow.
- Participated in the full life cycle of this project including information gathering, analysis, design, development, testing and support of this module.
- Responsible for identifying the regression test cases.
- Identify and analyze data discrepancies and data quality issues and works to ensure data consistency and integrity.
- Designed and Created Database objects and Developed ETL programs in python for data extraction and transformation from multiple sources.
- Created, edited procedures and functions for improved business requirements.
- Created complex Stored Procedures and SQL Joins and other statements to maintain referential integrity and implemented complex business logics.
- Modified tables, synonyms, sequences, views, stored procedures and triggers.
- Analyzed and defined critical test cases form regression stand point to be added to master regression suite.
- Prepared detailed design documentation including ETL data mapping documents and report specifications.
- Worked in Agile Methodology (Scrum) to meet timelines with quality deliverables.
- Reviewed database program to come up with UML modeling
- Developed a tool using excel + VBA to connect to database and get summary of instruments data for any day. Implemented database triggers based on the business rules and requirements.
- Extensively worked in Oracle SQL, PL/SQL, SQL*Plus, Query performance tuning, Created DDL scripts, created database objects like Tables, Views Indexes, Synonyms Sequences. Migration of existing data from MS access to oracle.
- Utilize AWS services with focus on big data analytics, enterprise data warehouse and business intelligence solutions to ensure optimal architecture, scalability, flexibility,
- Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event processing using lambda function.
- Performed Regression testing for Golden Test Cases from State (end to end test cases) and automated the process using python scripts.
- Developed parallel reports using SQL and Python to validated the daily, monthly and quarterly reports.
- Involved in writing JSP script to build the front-end application
- Unit testing on PL/SQL Packages and procedure, functions according to business requirement.
- Documented every phase like Technical specifications, source to target mappings, data and release notes. Assisted in testing and deployment of the application.
- Responsible for Import and export data from various data sources like SQL Server Databases, Flat Files, MS Access, MS Excel and other OLE DB providers achieved through Importing and Export Wizard. Created various visualizations and reports using VBA.
- Independently handle the deployment and support activities on Finance and SCM modules.
- Assisted in Functional Requirement Specification & Use Case Diagrams to streamlining the business flow. Used SQL Developer to Load / Extract data into and from Excel files
Environment: Machine learning, AWS, MS Azure, Cassandra, Spark, Avro, HDFS, GitHub, Hive, Pig, Linux, Python (Scikit-Learn/Scipy/Numpy/Pandas), R, SAS, SPSS, MySQL, Bitbucket, Eclipse, XML, PL/SQL, SQL connector, JSON, Tableau, Jenkins.
Data Engineer
Confidential
Responsibilities:
- Used Pandas, NumPy, Seaborn, SciPy, Matplotlib, Scikit-learn, and NLTK in Python for developing various machine learning algorithms and utilized machine learning algorithms such as linear regression, multivariate regression, naive Bayes, Random Forests, K-means, &KNN for data analysis.
- Designed and developed complex mappings by using Lookup, Expression, Update, Sequence generator, Aggregator, Router, Stored Procedure, etc., transformations to implement complex logics while coding a mapping.
- Designed workflows with many sessions with decision, assignment task, event wait, and event raise tasks, used Informatica scheduler to schedule jobs
- Developed Stored Procedures, Functions and Packages using SQL and PL/SQL
- Developed Complex Merge statement to update and insert data after load.
- Created indexes on the tables for faster retrieval of the data to enhance database performance.
- Developed stats collection script to improve performance of application
- Developed Store Procedures to automatically drop table indexes and create indexes and partitioning for the tables
- Optimizing the SQL query performance using partition, oracle hints, indexes and stats collection
- Developed Python script to run SQL query as parallel to initial load data into target table
- Developed the Python Script extract data from mobile Mongo DB and load data into Oracle database.
- Developed the Python script to reconcile the data between source and target using list search.
- Developed the UNIX shell scripting to run SQL files based on initial load instruction
- Developed SQL loader script to load data into staging table using sql loader performance tuning.
- Developed Informatica Release management tool for Exporting and Importing workflow
- Effectively used Informatica parameter files for defining mapping variables, workflow variables, FTP connections and relational connections
- Developed ETL batch automation using shell scripting for QA functional testing
- Tuned Informatica mappings/sessions for better ETL performance by eliminating bottlenecks on lookup transformation
- Effectively communicate with cross vertical team for E2E testing
- Use reusable transformation at various level of development
- Participated daily scrum meetings and getting review from Product owner
- Developed Informatica mapping read data from web services and loaded into oracle database
- Effectively creating table structure and implementing index and partitioning
Environment: Erwin, Python, SQL, SQL Server, Informatica, SSRS, PL/SQL, T-SQL, Tableau, MLlib, Spark, Kafka, MongoDB, logistic regression, Hadoop, Hive, OLAP, Azure, MariaDB, SAP CRM, HDFS, SVM, JSON, Tableau, XML, AWS.
Data Analyst
Confidential
Responsibilities:
- Performed data wrangling to clean, transform and reshape the data utilizing panda’s library. Analyzed data using SQL, R, Java, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
- Worked with different datasets which includes both structured and unstructured data and Participated in all phases of Data mining, Data cleaning, Data collection, variable selection, feature engineering, developing models, Validation and Visualization.
- Developed predictive models on large scale datasets to address various business problems through leveraging advanced statistical modeling, machine learning and deep learning.
- Implemented public segmentation using unsupervised machine learning algorithms by implementing K-means algorithm by using PySpark using data munging.
- Experience in Machine learning using NLP text classification using Python.
- Worked on different Machine Learning models like Logistic Regression, Multi-layer perception classifier and K-means clustering.
- Lead discussions with users to gather business processes requirements and data requirements to develop a variety of conceptual, logical and Physical Data models.
- Expertise in Business intelligence and Data Visualization tools like Tableau.
- Handled importing data from various data sources, performed transformations using Hive, MapReduce and loaded data into HDFS.
- Good knowledge in Azure cloud services, Azure Storage to manage and configure the data.
- Used R and Python for Exploratory Data Analysis to compare and identify the effectiveness of the data.
- Created clusters to classify control and test groups.
- Analyzed and calculated the life cost of everyone in a welfare system using 20 years of historical data.
- Used Python, R, SQL to create statistical algorithms involving Multivariate Regression, Linear Regression, Logistic Regression, PCA, Random forest models, Decision trees, SVM for estimating and identifying the risks of welfare dependency.
- Designed and implemented a recommendation system which leveraged Google Analytics data and the machine learning models and utilized Collaborative filtering techniques to recommend policies for different customers.
- Performed analysis such as Regression analysis, Logistic Regression, Discriminate Analysis, Cluster analysis using SAS programming.
- Worked on NoSQL databases including Cassandra, Mongo DB, and HBase to access the advantages and disadvantages of them for a goal of a project.
Environment: Machine Learning, R Language, Hadoop, Big Data, Python, DB2, MongoDB, Web Services.