Data Engineer Resume
PROFILE:
- Data Integration professional having 11+ years of experience across banking, finance, insurance and retail industries involving projects of various complexities.
- Roles played so far - Data Engineer, Application Designer, Senior Developer, Senior ETL developer, Data quality developer, Associate - Projects, Programmer Analyst.
- Experienced in developing ETL applications in AWS which include creating data pipeline using Glue, EMR(Managed Hadoop), Pyspark, Python, Scala, Redshift, S3.
- Experienced in data ingestion in Google Cloud and in building data pipeline using Apache Beam and DataFlow.
- Extensive experience in developing ETL applications using tools such as IBM Infosphere DataStage, Informatica PowerCenter, Talend and SSIS.
- Developed ETL applications using Spring Boot, Spring Batch(Java), Scala, Python by implementing industry standard best practices and design patterns.
- Completed Machine Learning program from Stanford University offered by Coursera.
TECHNICAL SKILLS:
Big Data Stack: Spark(PySpark & Scala), Hive, HDFS, Sqoop, Kafka, AWS Glue, Lambda, Athena, EMR, Step functions, API Gateway, Kinesis Data Streams, Kinesis FireHose, MSK(Kafka), Dynamo DB(NoSQL), Redshift, RDS, EC2, S3, CloudWatch, Google Cloud GCS, Big Query, Data Flow, Pub/Sub, Cloud Spanner, Apache Beam, StackDriver
ETL Tools: IBM Infosphere DataStage 11.3/ 9.1/8.5, Informatica PowerCenter 10.1.0/9.1.0 , Informatica Data Integration Hub, Talend Open Studio for Big DataSSIS 2012/2014, Diyotta
Data Quality: IBM Information Analyzer 11.3
Databases: Oracle, Sybase, MySQL, SQL Server, DB2, Hive, Postgres
Programming/Scripting: Python 3+ = PySpark, Flask (REST API), marshmallow, SQLAlchemy, unittest Scala => Core Scala, Spark, scalatest, sbt, Java {Core Java 8+, Spring Boot, Spring Batch, Spring REST, Maven, JUnit}
Shell Scripting: Analytics R <- Predictive Modelling
IDE: Eclipse, PyCharm, Anaconda, Zeppelin, Hue, Postman
Scheduling: Airflow, TIDAL, JAMS, Autosys, Zeke
CI/CD: Docker, GIT, Bit Bucket, Jenkins
EMPLOYER:
Confidential
Data Engineer
Responsibilities:
- Authored ETL applications from Scratch using AWS Glue, Python/Pyspark, S3 and Redshift. The goal is to extract data of various formats from S3, perform transformations and load them to S3, Redshift, DynamoDB. Performed unit testing using “unittest” framework.
- Strong experience in ETL using Pyspark. Have used a variety of Data Frame APIs to source and sink data and to perform ETL transformations.
- Worked on different type of file formats such as AVRO, Parquet, JSON, ORC and know the use cases for each of them.
- Gained experience in performance tuning from all aspects - Storage layer, Pyspark/Python and by using other techniques such as bloom filter.
- Extensively wrote Hive queries to perform data analysis and implemented performance tuning using appropriate bucketing and partitioning techniques.
- Performed data discovery in Zeppelin using PySpark/Python to understand source data. This helped to shape business requirement documents for development.
- Developed Pyspark/Python modules to parse highly complex data types and shared the module with the rest of the team.
- Implemented complex ETL design patterns in PySpark/Python such as Change Data Capture, Delta load using relevant APIs.
- Participated in key discussion with management and architects to choose the right set of tools required from development and end user analytics.
- Used BitBucket and GIT for code versioning and CI/CD. Have good knowledge of GIT lifecycle.
- Created workflow using Step functions to orchestrate Glue jobs and its dependencies.
- Scheduled Glue jobs using AWS CLI and Zeke scheduler. Have exposure to building pipelines using Apache Airflow(Using Python)
- Attended agile boot camp and practiced the principles in day-to-day project activities. Project is being delivered with a sprint duration of three weeks with ceremonies like daily scrum, sprint planning, refinement, Retrospection, Feedback and demo.
Confidential
Data Engineer
Responsibilities:
- Developed ETL jobs in AWS Glue using Scala and PySpark/Python. The initiative was to migrate existing Informatica mapping and workflows to Glue jobs and to ingest data from several feeds to data lake.
- Developed serverless applications in AWS using lambda and python (using boto library) to parse JSON files and to load to S3 in parquet format.
- Developed ETL jobs using python for various consumers. This involved creating python modules to parse json files, to extract data from relational tables, flat files and to perform transformations.
- Created python scripts to ingest data from on-premise to GCS and built data pipelines using Apache Beam and Data Flow for data transformation from GCS to Big query
- Developed REST APIs using Spring REST(Java) and Flask(python) to centralize consumption process across different downstream systems and containerized the application using Docker.
- As part of Enterprise Reporting and Business Intelligence team, developed Java/Spring Batch jobs to extract data from Simcorp(Oracle DB) and published datasets to on- premise and data lake consumers.
- Enhanced existing Spring Boot/Spring batch application to support various business needs such as repost of transactional data, to publish files to data lake(S3) and other change requests.
- Created migration strategy for publishers and consumers across the organization to ingest and consume files from data lake
- Created Informatica workflows and mappings and integrated them with publications, topics and subscriptions in Data Integration Hub (DIH) to publish and subscribe data.
- Participated in data quality initiative by developing Informatica mappings to create rules to report anomalies in the data.
- Designed, developed and maintained DataStage jobs and sequences to integrate various systems and published data to enterprise data warehouse(designed in star schema) for reporting purposes.
Confidential
Senior Datastage and Information Analyzer Developer/Data Quality Developer/Talend Developer
Responsibilities:
- Participated in the development of the ETL framework which involved, Extraction and transformation of data from various sources (DB2, Hive)Preparing files for Information analyzer to run data quality rulesCapturing invalid data after data quality checkUpdating the dimension tables and loading to fact tablesAggregating data from fact tables and loading aggregated data to DataMart tables.
- Performed data profiling on the source data using Information Analyzer to help business analyst gather data quality requirement for the development team.
- Developed data quality definition and data quality rules using Information Analyzer to capture invalid data from the source.
- Developed DataStage jobs and sequences to perform aggregation and to load aggregated data to Datamart tables. Cognos team used this data for generating reports.
- Worked in an agile environment with sprint duration of three weeks. This involved close interactions with the business team, business analyst and QA team.
- Created a batch process using Talend to connect to Hive, performed transformations according to specifications and loaded to HDFS.
Confidential
Senior ETL Developer - Informatica, SSIS 2012/2014
Responsibilities:
- Implemented code reusability by creating mapplets wherever applicable. This helped to reduce development effort and complexity of mapping.
- Implemented SCD1 and SCD2 techniques using Informatica mappings and workflows
- Handled large volumes of data by implementing appropriate partitioning techniques in Informatica mappings/workflows.
- As part of ETL conversion project, migrated some of Informatica mapping and workflows to SSIS packages. Performed reconciliation between before and after migration results to ensure 100% accuracy.
- Wrote complex T-SQL procedures for some of the business requirements and executed them from SSIS packages.
- Wrote Unix Shell scripts for various needs such as file validation, file archival, file manipulations, etc.
- Created workflows using workflow manager and scheduled workflows in Zena scheduling tool.
- Decreased the execution time of the batch by implementing indexes, updating statistics, and improved query execution time by analyzing execution plan.
- Project was delivered through agile methodology. Prepared a detailed ETL plan for each sprint and actively participated in scrum meetings
Confidential
Senior Datastage Developer
Responsibilities:
- Project - Data conversion from PeopleSoft 8.9 to PeopleSoft 9.2. Designed, developed and enhanced ETL framework for this conversion.
- Redesigned some of the existing process to improve turnaround time for data refreshes.
- Delivered code through agile methodology. Participated in scrum meetings to define goals for each sprint.
- Coordinated with business analysts in understanding the requirements and fill gaps if any.
- Developed Oracle PL/SQL stored procedures for various needs such as audit control, error logging etc.
- Created UNIX shell scripts to format and validate source files, to create backups of DataStage jobs and sequences, job launch etc.
- Defined standards for documentation for technical specifications, unit test cases, operation manual etc.
Confidential
ETL Developer/Team Lead - DataStage, SSIS 2012/2014
Responsibilities:
- Designed/architected applications in DataStage to extract data from various sources such as relational database, files (delimited and fixed width), xml etc., performed transformations of various complexities, and loaded to DataMart tables.
- Redesigned some of the existing applications for better performances and faster batch completion times by refactoring DataStage jobs and Sequences. Eg- Conventional approach was that staging to DataMart process should start only after all landing to staging processes completes. Proposed and implemented a process where in staging to DataMart could start in parallel to landing to staging processes.
- Extensively worked on T-SQL stored procedures to create some of the landing to staging and data enrichment processes.
- Migrated DataStage jobs and sequences from version 7.5 to version 8.5 for 10+ applications.
- As part of market data team, created an ETL framework using SSIS 2012 packages to extract market data from sources such as flat files, Oracle, performed transformation (used majority of the transformations in SSIS), and loaded them to DataMart (SQL Server 2012).
- Performed CDC in SSIS to perform incremental load/update/delete.
- Created SSIS packages to perform DB maintenance tasks such as updating statistics, rebuilding and reorganizing indexes, shrinking databases etc.
- Worked with infrastructure team for NAS mount request and server migration requests.
- Developed shell scripts to copy files from various sources systems . Scheduled DataStage jobs and file copy scripts in TIDAL.
- Developed shell scripts for file validations, file archival, job launch and for batch reports.
- Created T-SQL stored procedures for some of the landing to staging and data enrichment processes.
- Worked with Business Analyst and Stakeholders to onboard new source systems. This involved rigorous discussions to gather and clarify requirements.
- Version controlled DataStage job/sequences, shell scripts, Database scripts (DDL, DML, SPs, functions) through Jenkins by coordinating with change management team.
- Coordinated with production support team for various after deployment tasks such as PIV, Operational manual update, etc.