Cloud Data Engineer Resume
Atlanta, GA
SUMMARY
- 11+ years of experience in design and development, Integration talents on Big Data, Cloud services, Informatica in Insurance, Banking, Retail and HealthCare Industries.
- 5 years of experience as Cloud Data Engineering in Big data Hadoop ecosystems such as HDFS, Hive, Spark, Data Bricks, Kafka, Yarn on AWS, Azure cloud services and Cloud rational databases.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple files formats & transforming the data on EMR/ HDInsight into the customer usage patterns.
- Hands on Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Expertise in data processing semi-structured data (CSV, Parquet, XML and JSON) in Hive/ Spark by using Python programming.
- Experienced in using Hadoop eco system implemented batch mode and real-time data from multiple source system such as web services and loaded data in S3/RDS.
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Daemons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream-processing).
- Legacy Informatica logical code migrated into Python, Spark, Data Bricks used for ETL process and loaded Dataset results into cloud storage S3/ blob/ relational databases.
- Migrated SQL Server Database into multi cluster Snowflake environment, created data sharing multiple applications and snowflake virtual warehouses based on data volume/ Jobs.
- Good understanding NoSQL Mongo Database and front-end application event message information stored into Mongo Database.
- Informatica Development/ Architect/ Administration with Informatica Power center, Data Quality & Informatica BDM, Informatica Intelligent cloud services (IICS) products with using On-premise, Azure, Amazon web services.
- Data Profile, Analysis, data cleansing, Address Validation, fuzzy matching/ Merging, data conversion, exception handling on Informatica Data Quality 10.1 (IDQ).
- Hands-on using various AWS Services including EC2, EMR cluster, Redshift, Data Bricks, S3 Buckets, AWS Kinesis and IaaS /PaaS/SaaS.
- Hands-on using various Azure Services including Azure virtual machine, Blob storage, Data Lake, Data factory, Azure SQL, PostgreSQL, HDInsight.
- Virtualized the servers using Docker for the test/ dev-environments also configuration automation using Docker containers.
- Configuration Management and source code repository management using tools like GIT, TFS.
TECHNICAL SKILLS
Big Data: Hadoop HDFS, Hive, Spark 2.x, Python 3.x, Kafka, CDH5x, DataBricks, Stream set
Cloud Services: AWS- EC2, S3, EMR, Kinesis, AMI, Cloud watch, Docker, Redshift, VPC, IPaas. Azure- Azure VM, Blob, Data Lake, Data Factory, HDInsight, AD, Docker.
ETL: Informatica 10x/9x/8x, PowerCenter, Data Quality, IICS, BDM, MDM.
Database: SQL Server 2016/ 2012, SQL Azure, Snowflake, Mango DB, RDS, Oracle 11g/10g
OLAP Tools: OBIEE11g/ OBIEE10g, Business Objects, Tableau
Operating Systems/Tool: Red Hat 7/6, Amazon Linux, Ubuntu, Windows Server 2016/201.x, Jupiter notebook
PROFESSIONAL EXPERIENCE
Confidential, Atlanta, GA
Cloud Data Engineer
Responsibilities:
- Analyse, design and build Modern scalable distributed data solutions using with Hadoop, AWS cloud services.
- Experienced in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analysed them by running Hive queries and Python Spark SQL.
- Loaded the data into Spark RDD and in-memory data computation to generate the output response stored datasets into HDFS/ Amazon S3 storage/ relational databases.
- Legacy Informatica batch/ real time ETL logical code migrated into Hadoop using Python, Spark Context, Spark-SQL, Data Frames and Pair RDD’s in Data Bricks.
- Experienced in handling large datasets using partitions, Spark in-memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformation and other during ingestion process itself.
- Experienced in Performing tuning of Spark applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Implemented near-real time data processing using Stream Sets and Spark/Databricks framework.
- Implemented best practices to secure and manage the data in AWS S3 buckets and used the Spark custom framework to load the data from AWS S3 to Redshift.
- Hands on Amazon EC2, Amazon S3, Amazon RedShift, Amazon EMR, Amazon RDS, Amazon ELB, Amazon Cloud Formation, and other services of the AWS family.
- Spark Datasets are stored into Snowflake relational databases for perform Analytics reports.
- Migrated SQL Server Database into multi cluster Snowflake environment and created data sharing multiple applications and created snowflake virtual warehouses based on data volume/ Jobs.
- Developed Apache Spark jobs using python in test environment for faster data processing and used Spark SQL for querying.
- Hadoop Spark Docker container are used for validating data load for test/ dev-environments.
- Experience in job management using fair Scheduling and Developed job processing scripts using Oozie workflow.
- Perform Informatica Intelligent cloud services (IICS) pilot project on Amazon cloud services.
- Prepared documents for data pipeline and Data Migration document for smooth transfer of project from development to testing environment and then to production environment.
Environment: Amazon Elastic MapReduce, Spark, Hive, Python, Kafka, RDS, Informatica, SQL Server 2016, Snowflake Mango DB, S3, RDS, Redshift, Docker, Kubernetes.
Confidential, Redmond, DC
Data Engineer/ ETL Developer
Responsibilities:
- Created Pipelines in Python using Datasets/Pipeline to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Data Lake, Azure SQL Data warehouse.
- Responsible build Spark applications using Spark-SQL in Databricks for data extraction, transformations from multiple files formats & transforming the data on HDInsight.
- Develop ETL code for XML, CSV, TXT, JSON sources and loading the data from these sources into relational tables with using Pandas, NumPy on Python.
- Developed Apache Spark jobs using Python for faster data processing and used Spark SQL for querying.
- Experience in writing Spark RDD transformations, actions, Data Frames, case classes for the required input data and performed the data transformations using Spark-Core.
- Spark SQL queries, Data frames and import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into Blob storage.
- Design and implement streaming solutions using Kafka or Azure Stream Analytics.
- Build Power BI reports using output source file from Blob storage
- Migration of on-premise data (SQL Server / MongoDB) to Azure Data Lake Store (ADLS) using Azure Data Factory (ADF V1/V2).
- Design matching plans, help determine best matching algorithm, configure identity matching and analyse matching score using Informatica Data Quality (IDQ).
- Preform Data Profiling, Data Standardization, Address Validation, Matching and Merging used for Data quality.
- Created Data Governance business rules for validate at front end application level at real time.
- Designed and developed complex Informatica mappings by using Lookup, Expression, Update, Sequence generator, Aggregator, Router, Stored Procedure, etc.
- Informatica PowerCenter, Data Quality upgrade from 9.5 to 9.6 HF 2 and 9.6 to 10.1 for various applications.
- Setup and configure PowerCenter domain, grid and services, Hands on apply HF and EBF in different Informatica products as well as provide the day by day production support activities.
Environment: Azure HDInsight, Spark, Hive, Python, Kafka, RDS, Informatica Data Quality/ PowerCenter, SQL Server 2016, Mango DB, Blob Storage, Data Lake, Data Factory, RDS, Docker.
Confidential, FL
Sr ETL Developer
Responsibilities:
- Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica Integration Suite.
- Extensive experience in developing complex mappings in Informatica to load the data from various sources using different transformations like Source Qualifier, Lookup, Expression, Update Strategy etc.
- Worked with Informatica Data Quality 10.0 (IDQ) Analysis, data cleansing, fuzzy data matching, data conversion, exception handling.
- Designed and developed transformation rules (business rules) to generate consolidated (fact/summary) data using Informatica ETL tool.
- Deployed reusable transformation objects such as mapplets to avoid duplication of metadata, reducing the development time.
- Extracting, cleansing, aggregating, transforming and validating the data to ensure accuracy and consistency.
- Experience with Informatica Advanced Techniques - Dynamic Caching, Memory Management, Parallel Processing to increase Performance throughput.
- Extensively involved in Optimization and Tuning of mappings and sessions in Informatica by identifying and eliminating bottlenecks, memory management and parallel threading.
- Developed Informatica workflows and sessions associated with the mappings using workflow manager; Developed mapplets, reusable mapplets, mappings and source/ target definitions.
- Preparing all the DB scripts and Informatica objects for the implementation in production environment.
Environment: Informatica 9.6/9.1, PowerCenter, Data Quality, Amazon Cloud, Oracle, SQL server 2016/2014, Windows Server 2016/14.
Confidential, WI
Informatica Developer/ Administrator
Responsibilities:
- Developed ETL programs using Informatica to implement the business requirements.
- Hands on in all phases of SDLC from requirement gathering, design, development, testing, Production, user training and support for production environment.
- Modify the Informatica mappings, transformations, sessions and workflows in Informatica PowerCenter Designer/Manager if any change is requested from clients.
- Responsible for creating Workflows and sessions using Informatica workflow manager and monitor the workflow run and statistic properties on Informatica Workflow Monitor.
- Responsible for Defining Mapping parameters and variables and Session parameters according to the requirements and performance related issues.
- Created various tasks like Event wait, Event Raise and E-mail etc.
- Created Shell scripts for Automating of Worklets, batch processes and Session schedule using PMCMD.
- Responsible for design, implementation of Informatica 8.x/9.x platform and continue to support existing Informatica 8.x platform.
- Upgraded Informatica from 8.x to 9.x and setup Informatica PowerCenter Disaster Recovery and installed Informatica Hotfixes and EBF (emergency bug fix) on servers and updated windows/ Linux security patches on monthly basis.
- Configured Active Directory LDAP on Admin console to authenticate authorize Developers/ Business users.
- Configured Informatica Data Quality (IDQ) components like Model Repository Service, DIS Data Integration Service, Content Management services, web services and Business Glossary.
- Performed System level health checks CPU, Memory utilization, Number of parallel loads (sessions) running on each node and provided recommendations on capacity planning (Disk Space, Memory & CPU etc.)
- Created Deployment groups/ scripts for migration the code from lower to higher region.
- Extensively worked on scripts on automation (Auto restart of services, Disk space utilization, clean up the logs directories) and scripts using Informatica Command line utilities.
Environment: Informatica 9.1/8.6 PowerCenter, Power Exchange, Metadata manager, Oracle, SQL server 2014/2012, DB2, RedHat 5.6.