Data Engineer Resume
4.00/5 (Submit Your Rating)
Alexandria, VA
SUMMARY
- 8 + years hands - on experience in Data Science and Analytics including Big Query using, SQL, Data Collection, data warehousing, Data Cleaning, Featurization, Feature Engineering Data Mining, Machine learning and Statistical Analysis with large datasets of structured.
- Experience in using cloud services Amazon Web Services (AWS) including EC2, S3, AWS Lambda and EMR, used Redshift for migration.
- Experience in creating complex data pipeline process using T-SQL scripts, SSIS packages, Apteryx workflow, PL/SQL scripts, Cloud REST APIs, Python scripts, GCP Composer, GCPdataflow.
- Experience in building ETL systems using python and in-memory computing framework (Apache Spark), scheduling and maintaining data pipelines at regular intervals in Apache Airflow.
- Experience in analyzing data using Spark SQL, HIVEQL, PIG Latin, Spark/Scala and custom Map Reduce programs in Java.
- Experience in Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Experience of using CI/CD techniques and processes DevOps/Git repository code promotion.
- Experienced with version control systems like Git, GitHub, to keep the versions and configurations of the code organized.
- Experience with structured (MySQL, Oracle SQL, Postgre SQL) and unstructured (NoSQL) databases. Strong understanding of relational databases. Familiar with cross platforms ETL using Python/JAVA SQL connector, PySpark Data Frame.
- Expertise in Data Extraction, Transforming and Loading (ETL) between different Systems using SQL tools (SSIS, DTS, Bulk Insert, and BCP)
- Experience in designing, modelling, performance tuning and analysis, implementing processes using ETL tool Pentaho Data Integration (PDI) tool for Data Extraction, transformation and loading processes. Designing end to end ETL processes to support reporting requirements. Designed aggregates, summary tables and materialized views for reporting.
- Extensive experience in installing and configuring Pentaho BI Server for ETL and reporting purposes.
- Experience in using Design Patterns such as MVC, Singleton and frameworks such as DJANGO.
- Knowledge on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud DataProc, Cloud Pub/Sub, cloud SQL, Big Query, stack driver monitoring and cloud deployment manager.
PROFESSIONAL EXPERIENCE
Data Engineer
Confidential, Alexandria, VA
Responsibilities:
- Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks.
- Built data pipelines to move data from source to destination scheduling by Airflow.
- Developed BIXExtract application in Python to ingest Pega (Complaint Ssytem) files to HDFS and configure Airflow DAGs to orchestrate ETL workflow.
- Involved in Agile Development process (Scrum and Sprint planning).
- Involved in various sectors of business, with In-depth knowledge of SDLC (System Development Life Cycle) with all phases of Agile - Scrum, & Waterfall.
- Developed Map Reduce jobs using Java to process large data sets by fitting the problem into the Map Reduce programming paradigm.
- Developed Spark scripts by using Java, and Python shell commands as per the requirement.
- Working with CI/CD tools such as Jenkins and version control tools Git,Bitbucket
- Worked on source control tools like Tortoise SVN, CVS, IBM Clear Case, Perforce, and GIT.
- Created pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks.
- Use the RUP and agile methodology to conduct new development and maintaining software.
- Developed central and local flume framework for loading large log files into the Data Lake.
- Designed and implemented distributed systems with Apache Spark and Python/Scala.
- Created Python / SQL scripts, to transform Databricks notebooks from Redshift table into Snowflake S3 buckets
- Worked with Reporting developers to oversee the implementation of report/universe designs.
- Created visualizations of KPIs and critical financial metrics (Domo, Python)
- Worked on designing and implementing complex applications and distributed systems into public cloud infrastructure (AWS, GCP, Azure, etc
- Designed workflows using Airflow to automate the services developed for Change data capture.
- Created PowerBI, SSRS, Tableau and Domo reports based on the format specified in the design document.
- Built code for real time data ingestion using Java, MapR-Streams (Kafka) and STORM.
- Used Eclipse IDE to develop Spark java code to insert data to HBase.
- Responsible for creating a Data pipeline flows, scheduling jobs programmatically (DAG) in Airflow workflow engine, and providing support for the scheduled jobs.
- Worked with Informatica Cloud for data integration between Salesforce, Right Now, Eloqua, WebServices applications
- Involved in modeling datasets from verity of data sources like Hadoop (using Pig, Hive, Spark), Teradata and Snowflakes for ad-hoc analysis and have fair understanding of AGILE methodology and practice.
- Generated SQL Scripts using python to extract Structured and non-structured data from various platforms - Teradata, Redshift, Snowflake, and Databricks.
- Built, maintained and tested infrastructure to aggregate critical business data into Google Cloud Platform (GCP) Big Query and GCP Storage for analysis.
- Design, implement and own administration of multiple public cloud environments (AWS & GCP).
- Worked with Jenkins CI for CI/CD and Git version control.
- Designed and developed data flow solutions using NIFI to transfer data to HDFS in Data Lake.
- Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
- Worked in automation, setup and administration of build and deployment CI/CD tools such as Jenkins, and integrated with Build Automation tools like ANT, Maven, Gradle, Bamboo, JIRA, BitBucket for building of deployable artifacts
- Worked on all phases of data integration development lifecycle, realtime/batch data pipelines design and implementation, and support of WU Digital Big Data ETL&Reporting track.
- Used to manage GitLab and Bit Bucket account for providing access to the Developers and storing the source code.
- Wrote SQL queries to identify and validate data inconsistencies in data warehouse against source system.
Environment: SOAP, REST APIs, SQL, Azure, ETL, APIs, cloud, UNIX, PL/SQL, CI/CD, Matplotlib, PyHive, Keras, Java, NoSQL- HBASE, Sqoop, Pig, MapReduce, Oozie, SparkMLlib.
Data Engineer
Confidential - LA
Responsibilities:
- Design & Develop batch processing solutions by using Data Factory and Azure Databricks
- Designed, developed and implemented solutions with data warehouse, ETL, data analysis and BI reporting technologies.
- Identified, evaluated, and documented potential data sources in support of project requirements within the assigned departments as per agile methodology
- Created Python / SQL scripts, to transform Databricks notebooks from Redshift table into Snowflake S3 buckets
- Extensively worked on Data Services for migrating data from one database to another database.
- Implemented various performance optimization techniques such as caching, Push-down memory-intensive operations to the database server, etc.
- Working with developing customized UDF's in java to extend Hive and Pig Latin functionality.
- Involved in data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement and used Cassandra through Java services.
- Involved in Agile development methodology active member in scrum meetings.
- Involved in continuous integration and deployment (CI/CD) using DevOps tools like Looper, Concord
- Designed a workflow using Airflow to automate the jobs.
- Implemented a CI/CD pipeline with Jenkins, GitHub, Nexus, Maven and AWS AMI's.
- Created several Databricks Spark jobs with Pyspark to perform several tables to table operations.
- Designed and Implement test environment on AWS.
- Created S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
- Followed Agile & Scrum principles in developing.
- Involved in porting the existing on-premise Hive code migration to GCP (Google Cloud Platform) BigQuery.
- Implemented both ETL and ELT architectures in Azure using Data Factory, Databricks, SQL DB and SQL Data warehouse.
- Built a data pipeline and data applications to analyze email marketing campaigns, using Power Shell, SQL Azure and Power BI.
- Built a dashboard using DOMO and Tableau to build various business and operational for Guest Emails to have better insights for the management.
- Supported current data processing and compliance initiative by creating technical and summary documentation.
- Participate in daily standups, bi-weekly scrums and PI panning. The New Management Services is a SAFe (Agile) certified organization.
- Involved in design and development of UI using ASP.Net after interacting with users for requirements
- Transferred data l from AWS S3 to AWS Redshift.
- Engineered PySpark report processing pipeline in AWS to lay a framework to migrate existing system off Cloudera and enable business users to create their own customer reports without support.
- Worked on developing Map-Reduce scripts in Python.
Environment: Python, AWS S3, AWS Redshift, AWS Data Pipeline, Spark, CI/CD, IBM DB2, Airflow, SAP ECC, SparkML, SQL, agile, ELT, S3, SQL DB, SQL AZURE, AWS.
ETL Developer
Confidential - Broomfield, CO
Responsibilities:
- Gathered business requirements and prepared technical design documents, target to source mapping document, mapping specification document.
- Extensively worked on Informatica Powercenter.
- Parsed complex files through Informatica Data Transformations and loaded it to Database.
- Optimized query performance by oracle hints, forcing indexes, working with constraint based loading and few other approaches.
- Gathered requirements from Business and documented for project development.
- Coordinated design reviews, ETL code reviews with teammates.
- Developed mappings using Informatica to load data from sources such as Relational tables, Sequential files into the target system.
- Created and worked Sqoop jobs with incremental load to populate Hive External tables.
- Leads research efforts to identify and recommend technical and operational improvements resulting in improved reliability and/or efficiencies in maintaining and/or developing the application.
- Worked on requirement gatherings, source to target mappings, high level designs, low level designs, design and finalization of end to end ETL process flow.
- Developed Ab-initio artifacts graphs/Jobs for Home loans, Agency Finance, NBTH and RMS projects.
- Tested scripts and graphs in the development environment. Prepared UTC/UTR for the same.
- Extensively worked with Informatica transformations.
- Performed Unit, Integration and System testing of various jobs.
- Extensively worked on UNIX Shell Scripting for splitting group of files to various small files and file transfer automation.
- Worked with Autosys scheduler for scheduling different processes.
- Performed basic and unit testing.
- Assisted in UAT Testing and provided necessary reports to the business users.
Environment: UNIX Shell, UAT, ETL, AWS S3, AWS Redshift, AWS Data Pipeline, SAP HANA.