Senior Data Engineer Resume
Irving, TX
PROFESSIONAL SUMMARY:
- Overall 9+ years of experience in Analysis, Design, Development, Testing, Implementation, Maintenance and Enhancements on various IT Projects.
- Around 6+ years of experience in Big Data in implementing end - to-end Hadoop solutions.
- Hands on experience in installing, configuring and using Apache Hadoop ecosystem components like Hadoop Distributed File System (HDFS), MapReduce, PIG, HIVE, HBASE, ZOOKEEPER, SQOOP, Kafka
- Expertise in writing Hadoop Jobs to analyze data using MapReduce, Hive, Pig and Solr.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and vice-versa.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java.
- Hands on experience on configuring a Hadoop cluster in an enterprise environment and on VMware and Amazon Web Services (AWS) using an EC2 instances.
- Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns
- Experience in analyzing data using Python, R, SQL, Microsoft Excel, Hive, PySpark, Spark SQL for Data Mining, Data Cleansing, Data Munging and Machine Learning
- Experience configuring and working on AWS EMR Instances.
- Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table
- Proficient in using Cloudera Manager, an end-to-end tool to manage Hadoop operations
- Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
- Experience in building Snow-pipe, Snowflake Clone and Time Travel.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Implemented complex Data models for end to end Machine Learning Deployments using TensorFlow, Keras
TECHNICAL SKILLS:
Big Data: HDFS, MapReduce, Pig, Hive, Kafka, Sqoop, Spark Streaming, Spark SQL, Oozie, Zookeeper.
Databases: MySQL Server, Oracle DB, HiveQL, Spark SQL, PostgresSQL, HBase, Mongo DB, Dynamo DB, S3
Cloud: AWS, Azure, GCP
IDE Tools: PyCharm, IntelliJ IDEA, Databricks, Anaconda.
Programming languages: Java, Linux shell scripts, Scala, Python(Numpy, TensorFlow, Keras, Pandas, etc.).
ETL Tools: Informatica Powercenter, Talend, AWS Glue, Azure Data Factory
Machine Learning: AWS SageMaker, Azure ML Studio
Data warehouse: Snowflake, Teradata,, EDW
Web Technologies:, XML, JavaScript.
WORK EXPERIENCE:
Confidential
Senior Data Engineer
Responsibilities:
- Responsible for designing and implementing End to End data pipeline using Big Data tools including HDFS, Hive, Sqoop, HBase, Kafka & Spark.
- Extracting, Parsing, Cleaning and ingesting the incoming web feed data and server logs into the HDFS by handling structured and unstructured data.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
- Using Sqoop to extract and load incremental and non-incremental data from RDBMS systems into Hadoop.
- Scheduled tasks in Airflow, writing Python scripts
- Developing spark applications for data transformations and loading into HDFS using RDD, Data-frames and Datasets.
- Extensively used Hive optimization techniques like partitioning, bucketing, Map Join and parallel execution.
- Created Data frames using SparkSQL and worked on loading the data into No-SQL Database.
- Worked on SparkSQL and created data warehouse for both Spark and hive.
- Experience in implementing hybrid connectivity between Azure and on - premise using virtual networks, VPN and Express Route
- Plan and Develop road-maps and deliverables to advance the migration of existing solutions on-premise systems/applications to Azure cloud
- Implemented ETL and data movement solutions using Azure Data Factory, SSIS create and run SSIS Package ADF V2
- Deploying Azure Resource Manager JSON Templates from Power Shell worked on Azure suite: Azure SQL Database, Azure Data Lake, Azure Data Factory, Azure SQL Data Warehouse, Azure Analysis Service
- Extracted Transformed and Loaded data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL and Azure Data Lake Analytics . ingested data to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
- Developed ETL jobs in Informatica Powercenter to automate the real time data retrieval from Salesforce.com, suggest best methods for data replication from Salesforce.com.
- Implemented ETL framework to provide features such as Master Data Management, ETL - restart capability, security model and version control using Informatica Powercenter.
- Developed and modified the complex Informatica mappings using BDM/DEI, PowerCenter, IICS for business rules integration and changes.
- Designed and implemented data Pipelines using Informatica PowerCenter, BDM/DEI, IICS (Informatica Intelligent Cloud Services)
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
- Worked with SQL and DB optimization included MYSQL, PostgresSQL, and SQL Server
- Integrated the data pipeline using Python with Git requests and automated the process using Apache airflow
- Integrated Informatica Power center, with Snowflake data store, for the BI teams to leverage the business use-cases.
- Developing and designing data integration solutions using ETL tool such as Informatica Powercenter. & sorting in Snowflake Multi - Cluster Warehouses.
- Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS.
- Analyzed data from multiple sources and creating reports with Interactive Dashboards using Power BI.
Environment: Hadoop, Spark, Spark SQL, PySpark, Azure Databricks, Azure DataFactory (ADF), PostgresSQL Git, Azure SQL Server, Kafka. Hive, HBase, UNIX Shell Scripting, Airflow, Power-bi, Python, Kubernetes
Confidential, Irving, TX
Sr. Data Engineer
Responsibilities:
- Designed solution for Streaming data applications using Apache Storm.
- Extensively worked on Kafka and Storm integration to score the PMML (Predictive Model Markup Language) Models.
- Designed several DAGs (Directed Acyclic Graph) for automating ETL pipelines
- Created and monitored jobs using Apache Airflow and Oozie
- Extensively written Hive queries for data analysis to meet the business requirement.
- Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka and Persists into HDFS.
- Developed ETL jobs, using Informatica Powercenter to automating the several transactional data from request step through approval process and service fee calculation for weekly mortgage payment
- Responsible for analyzing using Spark SQL queries result with Hive queries.
- Involved in requirement and design phase to implementing real time streaming using Kafka and Storm.
- Strong experience with Informatica Designer, Workflow Manager, Workflow Monitor, Repository Manager.
- Implemented rapid-provisioning and life-cycle management for using Amazon EC2 and custom Bash scripts.
- Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
- Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
- Experience in developing/consuming Web Services (REST, SOAP, JSON) and APIs (Service-oriented architectures).
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
- Created data partitions on large data sets in S3 and DDL on partitioned data.
- Implemented rapid-provisioning and life-cycle management for using Amazon EC2 and custom Bash scripts.
- Developed the programs to validate the data after ingesting the data into Data Lake using UNIX
Environment: Hadoop, Python, Spark, PySpark, AWS EC2, Spark SQL, AWS Glue, AWS Lambda, Git, Airflow, AWS S3, AWS EMR, Kafka. Hive, HBase, Maven, UNIX Shell Scripting, Kubernetes, Cassandra, Power-bi, REST API.
Confidential
Data Engineer
Responsibilities:
- Used Sqoop extensively to import data from RDMS sources into HDFS. Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, and loaded final data into HDFS
- Developed Pig UDFs to pre-process data for analysis.
- Implemented test cases for Spark and Ignite functions using Scala as language.
- Understanding of Snow-flake cloud technology.
- Developed Spark streaming application to pull data from cloud to hive table.
- Hands on experience with Apache Spark using Scala. Implemented spark solution to enable real time report from Cassandra data.Automate the data retrievals, data loads, validation framework, etl-restart on star wars digital data hub.
- Extensive expertise using the core Spark APIs and processing data on an EMR cluster
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena
- Worked on ETL pipeline to source these tables and to deliver this calculated ratio data from AWS to Datamart (SQL Server)
- Experience in using and tuning relational databases (e.g. Microsoft SQL Server, Oracle, MySQL) and columnar databases (e.g. Amazon Redshift, Microsoft SQL Data Warehouse)
- ETL Restarting capability for a date or date range or from point of failure or from beginning.
- Create ETL scripts for the ad-hock requests, requests to retrieve data from analytic sites.
- Perform Informatica Intelligent cloud services (IICS) pilot project on Amazon cloud services
- Used Sqoop to import the data on to Cassandra tables from different relational databases like Oracle, MySQL.
- Automated all the jobs from pulling data from databases to loading data into SQL server using shell scripts.
- Involved in running Hadoop streaming jobs to process terabytes of text data..
- Fixed the code review comments, build the Jenkins and support for the code deployment into the production. Fixed the post-production defects to perform the Map/Reduce code to work as expected.
Environment: Hadoop, Spark, Map Reduce, Informatica Power-center, Python, Jira, Hive, Sqoop, HBase, UNIX Shell Scripting, AWS Glue, AWS Athena, EMR
Confidential
Big Data Developer
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS, developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
- Engage with business users to gather requirements, design visualizations and provide to use self-service BI tools.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
- Wrote MapReduce jobs to discover trends in data usage by users.
- Involved in managing and reviewing Hadoop log files.
- Designed and developed complex mappings to extract data from various sources including flat files, RDBMS tables, and legacy systems using Informatica.
- Used dynamic cache memory and index cache to improve the performance of Informatica server
- Loaded and transformed large sets of structured, semi structured and unstructured data with map reduce and pig.
- Wrote pig UDF’s and developed Snow-pipes and Snowflake schema to reduce data redundancy.
- Developed HIVE queries for the analysts.
- Automated all the jobs starting from pulling the Data from different Data Sources like MySQL and pushing the result datasets to Hadoop Distributed File System and running MR jobs and PIG/Hive using Oozie (Work Flow management).
- Monitored System health and logs and respond accordingly to any warning or failure conditions.
Environment: Hadoop (CDH 4), HDFS, Map Reduce, Python, Hive, Java, Kafka, Pig, Sqoop, Oozie, REST Web Services, HBase, UNIX Shell Scripting.
Confidential
Hadoop Developer
Responsibilities:
- Analyzed large data sets by running Hive queries and Pig scripts.
- Involved in creating Hive tables, and loading and analyzing data using hive queries.
- Developed Simple to complex MapReduce Jobs using Hive and Pig.
- Involved in running Hadoop jobs for processing millions of records of text data.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required
- Developed multiple MapReduce jobs in java for data cleaning and pre-processing.
- Involved in unit testing using MR unit for Map Reduce jobs.
- Involved in loading data from LINUX file system to HDFS.
- Integrated the snowflake data-warehouse in the pipeline with ingestion from the ETL pipeline
- Responsible for managing data from multiple sources.
- Experienced in running Hadoop streaming jobs to process terabytes of XML format data.
- Load and transform large sets of structured, semi structured data.
- Responsible to manage data coming from different sources.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
Environment: Hadoop, HDFS, Pig, Python, Hive, MapReduce, HBase, Sqoop, LINUX, Java, Python