Sr Aws Data Engineer Resume
MA
SUMMARY
- Around 7 years of IT development experience, including experience in Big Data ecosystem, and related technologies.
- Modeling, Data Science, Data Architecture, Programming Analysis and Database Design of OLTP and OLAP systems with sound knowledge in Cloud Technologies (AWS, Azure).
- Expertise in Hadoop ecosystem components such as Spark, HDFS, Map Reduce, Yarn, HBase, Pig, Sqoop, Flume, Oozie, Impala, Zookeeper, Hive, NiFi and Kafka for scalability, distributed computing, and high - performance computing.
- Excellent understanding of Hadoop architecture, Hadoop daemons and various components such as HDFS, YARN, Resource Manager, Node Manager, Name Node, Data Node and MapReduce programming paradigm.
- Good understanding of Apache Spark, Kafka, Storm, Nifi, Talend, RabbitMQ, Elastic Search, Apache Solar, Splunk and BI tools such as Tableau.
- Knowledge of Hadoop administration activities using Cloudera Manager and Apache Ambari.
- Experience working with Cloudera, Amazon Web Services (AWS), Microsoft Azure and Hortonworks
- Worked on Import and Export of data using Sqoop from RDBMS to HDFS.
- Have good knowledge in Containers, Docker and Kubernetes for the runtime environment for the CI/CD system to build, test, and deploy.
- Docker container orchestration using ECS, ALB and lambda, ACR(Azure container Registry), ACI(Azure container Instances) and Server less functions.
- Created machine learning models with help of python and scikit-learn.
- Hands on experience in loading data (Log files, Xml data, JSON) into HDFS using Flume/Kafka.
- Experience in Pyspark programming language with Spark Core and Spark modules extensively.
- Strong experience in writing scripts using python API, Pyspark API and Spark API for analyzing the data
- Experience in dealing with data formats ORC, Parquet, JSON and CSV.
- Built ETL data pipelines using Python/MySQL/Spark/Hadoop/Hive/UDFs
- Experience in analyzing data using Hive QL, Pig Latin, HBase, Spark, R Studio and custom Map Reduce programs in python. Extending Hive and Pig core functionality by writing custom UDFs.
- Used packages like Numpy, Pandas, Matplotlib, Plotly in python for exploratory data analysis.
- Hands on experience with cloud technologies such as S3, EC2, RDS, EMR, Redshift, Glue, Athena, data pipeline and other services of AWS.
- Experience in working with Azure blob, Azure Synapse, Azure Data Lake, Azure Data Factory, Azure SQL, Azure SQL Data warehouse, Azure Analytics.
- Involved in building Data Models and Dimensional Modeling with 3NF, Star and Snowflake schemas for OLAP and Operational data store (ODS) applications.
- Good knowledge in using Apache NiFi to automate the data movement between different Hadoop systems.
- Experience in performance tuning by using Partitioning, Bucketing and Indexing in Hive.
- Experienced in job workflow scheduling and monitoring tools like Airflow, Oozie, TWS, Control-M and Zookeeper.
- Experience with Software development tools such as JIRA, GIT, and SVN.
- Flexible working Operating Systems like Unix/Linux (Centos, Red hat, Ubuntu) and Windows Environments.
- Hands on development experience with RDBMS, including writing complex SQL scripting, stored procedure, and triggers.
- Experience in writing Complex SQL Queries involving multiple tables inner and outer joins.
- Strong in databases like DB2, Oracle, MS SQL.
- Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, Cloud Watch, SNS, SES, SQS and other services of the AWS family.
- Selecting appropriate AWS services to design and deploy an application based on given requirements.
- Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as storage. Capable of using AWS utilities such as EMR, S3 and cloud watch to run and monitor Hadoop and spark jobs on Amazon Web Services (AWS).
- Strong knowledge in working with Amazon EC2 to provide a complete solution for computing, query processing, and storage across a wide range of applications.
- Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.
- Skilled in using Amazon Redshift to perform large scale database migrations
TECHNICAL SKILLS
Hadoop and Big Data Technologies: HDFS, MapReduce, Flume, Sqoop, Pig, Hive, Morphline, Kafka, Oozie, Spark, Nifi, Zookeeper, Elastic Search, Apache Solr, Talend, Cloudera Manager, R Studio, Confluent, Grafana
NoSQL: HBase, Couchbase, Mongo, Cassandra
Programming and Scripting Languages: C, SQL, Python, C++, Shell scripting, R
Web Services: XML, SOAP, Rest APIs
Databases: Oracle, DB2, MS-SQL Server, MySQL, MS-Access, Teradata
Web Development Technologies: JavaScript, CSS, CSS3, HTML, HTML5, Bootstrap, XHTML, JQUERY, PHP
Operating Systems: Windows, Unix (Red Hat Linux, Cent OS, Ubuntu), MAC-OS
IDE Development Tools: Eclipse, Net Beans, IntelliJ, R Studio
Build Tools: Maven, Scala Build Tool (SBT), Ant
PROFESSIONAL EXPERIENCE
Confidential, MA
Sr AWS Data Engineer
Responsibilities:
- Extensive experience in working with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda and Glue). Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Developed Spark Applications by using Python and Implemented Apache Spark data processing Project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop.
- Using Spark Context, Spark-SQL, Spark MLib, Data Frame, Pair RDD and Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common.
- Learner data model which gets the data from Kafka in real time and persist it to Cassandra.
- Developed Kafka consumer API in python for consuming data from Kafka topics. Consumed Extensible Markup Language (XML) messages using Kafka and processed the xml file using Spark Streaming to capture User Interface (UI) updates.
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file. Load D-Stream data into Spark RDD and do in memory data Computation to generate output response.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a Data pipe-line system.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for data sets processing and storage.
- Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Configured Snow pipe to pull the data from S3 buckDAtaets into Snowflakes table.
- Stored incoming data in the Snowflakes staging area.
- Created numerous ODI interfaces and load into Snowflake DB. worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse
- Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready datasets onto Snowflake tables.
- Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients
- Extensively worked on developing Spark jobs in Python(Spark SQL) using Spark APIs
Confidential, TX
Azure Data Engineer
Responsibilities:
- Worked on Azure Data Factory to integrate data of both on-prem (MY SQL, Cassandra) and cloud (Blob storage, Azure SQL DB) and applied transformations to load back to Azure Synapse.
- Evolved in Spark Scala functions for mining data to provide real time insights and reports. Configured spark streaming to receive real time data from the Apache Flume and store the stream data using Scala to Azure Table.
- Data Lake is used to store and do all types of processing and analytics.
- Ingested data into Azure Blob storage and processed the data using Data bricks. Involved in writing Spark Scala scripts and UDF's to perform transformations on large datasets.
- Utilized Spark Streaming API to stream data from various sources. Optimized existing Scala code and improved the cluster performance.
- Involved in using Spark Data Frames to create Various Datasets and applied business transformations and data cleansing operations using Data Bricks Notebooks.
- Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi.
- Tasks are distribution on celery workers to manage communication between multiple services. Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Data warehouse and improved the query performance.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, and MongoDB) into HDFS.
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API.
- Create several data bricks spark jobs with Pyspark to perform several tables to table operations
- Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion.
- Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow. Workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL).
- Managed resources and scheduling across the cluster using Azure Kubernetes Service. AKS can be used to create, configure and manage a cluster of Virtual machines.
- Extensively used Kubernetes which is possible to handle all the online and batch workloads required to feed, analytics and machine learning applications.
- Data Extraction, aggregations and consolidation of adobe data within aws glue using Pyspark.
- Developed the Pyspark code for Aws glue jobs and for EMR.
- Used Azure DevOps and VSTS (Visual Studio Team Services) for CI/CD, Active Directory for authentication and Apache Ranger for authorization.
- Experience in working with Spark applications like batch interval time, level of parallelism, memory tuning to improve the processing time and efficiency.
- Used Scala for amazing concurrency support, and Scala plays the key role in parallelizing processing of the large data set
Confidential, Tempe, AZ
Big Data Engineer
Responsibilities:
- Worked on Cloudera CDH installation on EC2s ensuring the setup for running MapReduce jobs using Hive& Spark.
- Worked on ETL jobs through Spark with SQL, Hive, and Streaming & Kudu Contexts.
- Converted ETL pipelines to Scala code base and performed data accessibility to & from S3.
- Performed Sqoop ingestion through Oozie workflows from MSSQL server and SAP HANA Views.
- Implemented slowly changings Dimensions (SLDs) in S3 while populating the data to S3.
- Performed record joins using Hive and Spark using the Data sets and pushed the Tables to Apache Kudu.
- Performed PostgreSQL DDL parsing to be Amazon Redshift Compatible form in building the data ware housing.
- Worked on raw data migration to Amazon cloud into S3 and performed refined data processing.
- Written the cloud formation template in JSON format for leveraging the content delivery with Cross Region Replication using the Amazon Virtual Private Cloud.
- Implemented the Columnar Data Storage, Advanced Compression and Massive Parallel Processing using the Multinode Redshift feature.
- Code contribution to the next generation Data Lake Accelerator which leverages the Scala Spark APIs for processing records based on the Datasets and Schema files provided as parameters.
- Implemented Hadoop jobs on a EMR cluster performing several Spark, Hive & MapReduce Jobs for processing data for building recommendation Engines, Transactional fraud analytics and Behavioral insights.
- Team player for the Data Lake production support. The Data Lake typically supports over 750million searches/day, 9Billion pricing inventory updates/day, 14 Trillion automated transactions/year generating around 1.2TB of Data daily.
- Populating the Data Lake is done by leveraging Amazon S3 services interactions made possible through Amazon Cognito/Boto/s3cmd.
- Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing.
- Scheduling Batch jobs through Amazon Batch performing Data processing jobs by leveraging Apache Spark APIs through Scala.
Confidential, MA
Data Engineer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop.
- Worked with HiveQL on big data of logs to perform a trend analysis of user behavior on various online modules.
- Responsible for working on the most cutting - edge Big Data technologies.
- Developed Pig scripts for analyzing large data sets in the HDFS.
- Collected the logs from the physical machines and the OpenStack controller and integrated into HDFS using Flume.
- Designed and presented plan for POC on Impala.
- Involved in migrating HiveQL into Impala to minimize query response time.
- Responsible for creating Hive tables, loading the structured data resulted from MapReduce jobs into the tables and writing Hive queries to further analyze the logs to identify issues and behavioral patterns.
- Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance enhancement and storage improvement.
- Imported data from mainframe dataset to HDFS using Sqoop. Also handled importing of data from various data sources (i.e. Oracle, DB2, Cassandra, and MongoDB) to Hadoop, performed transformations using Hive, MapReduce.
- Implemented Daily Cron jobs that automate parallel tasks of loading the data into HDFS using Oozie coordinator jobs.
- Responsible for performing extensive data validation using Hive
- Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Involved in loading data from Teradata database into HDFS using Sqoop queries.
- Involved in submitting and tracking MapReduce jobs using Job Tracker.
- Involved in creating Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
- Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
- Exported data to Tableau and excel with Power view for presentation and refining
Confidential
ETL Developer
Responsibilities:
- Extensively used Informatica Client tools Power Center Designer, Workflow Manager, Workflow Monitor and Repository Manager.
- Extracted data from various heterogeneous sources like Oracle, Flat Files.
- Developed complex mapping using Informatica Power Center tool.
- Extracting data from Oracle and Flat file, Excel files and performed complex joiner, Expression, Aggregate, Lookup, Stored procedure, Filter, Router transformation, Update strategy transformations to load data into the target systems.
- Created Sessions, Tasks, Workflows and Worklets using Workflow manager.
- Developed workflow dependency in Informatica using Event Wait Task, Command Wait.
- Involved in analyzing the existence of the source feed in existing CSDR database.
- Handling high volume of day to day Informatica workflow migrations. Review of Informatica ETL design documents and working closely with development to ensure correct standards are followed.
- Creating new repositories from scratch, backup and restore. Experience in working with Groups, roles, privileges and assigned them to each user group.
- Knowledge in Code change migration from Dev to QA and QA to Production.
- Worked on SQL queries to query the Repository DB to find the deviations from Company's ETL Standards for the objects created by users such as Sources, Targets, Transformations, Log Files, Mappings, Sessions and Workflows.
- Building UNIX scripts in cleaning up the source files.
- Involved in loading all the sample source data using SQL loader and scripts.
- Creating Informatica workflows to load the source data into CSDR.
- Involved in creating various UNIX script used during ETL load process.
- Periodically cleaning up Informatica repositories. monitoring the daily load and handing over the stats with the QA Team.