Aws Data Engineer Resume
New, YorK
SUMMARY
- IT professional wif 7+ years of experience in Development of Big Data ecosystem - Data Acquisition, Ingestion, Modeling, Storage Analysis, Integration, and Data Processing, Cloud Engineering, Data Warehousing.
- Experience in large scale application development using Big Data Environment - AWS, AZURE and Hadoop Ecosystem (HDFS, MapReduce, and Yarn), Spark, Kinesis, Hive, Impala, HBase, Airflow, Oozie, Zookeeper, Nifi).
- Worked wif AWS cloud (EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elastic search, Kinesis, SQS, DynamoDB, Redshift, and ECS).
- Experience in converting and transferring vast amounts of data Into through AWS Elastic MapReduce and out of other AWS data storage and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Experience in AWS cloud services such as compute, network, and storage, as well as Identity & Access Management.
- Integrated Apache Airflow wif AWS to monitor multi-stage ML flow works wif the tasks running on Amazon SageMaker.
- Worked on Azure cloud (HDInsight, AKS, Databricks, Data Lake, Blob Storage, Data Factory, Sage Storage Explorer, SQL DB, and SQL DWH).
- Managed resources and scheduling across the cluster using Azure Kubernetes Service.
- Extract, transform and load data from sources systems to Azure Data Storage services using a combination of Azure Data Factory, Azure Synapse Analytics, Spark SQL.
- Migrated SQL databases to Azure Data Lake, Azure SQL Database, Data Bricks, and Azure SQL Data Warehouse.
- Controlling and providing database access and migrating on-premises databases to Azure Data Lake wif Azure Data Factory.
- Worked on Azure Data Factory to connect on-premises (MY SQL, Cassandra) and cloud-based (Blob storage, Azure SQL DB) data and implemented transformations to load the data back Into Azure Synapse.
- Participated in the development of data intake pipelines on Azure HDInsight Spark cluster wif Azure Data Factory and Spark SQL.
- Created several pipelines to Import data from the Azure data lake Into Staging SQL DB and then Azure SQL DB.
- Worked in Azure Storage Explorer to manage data - Azure blobs, files and used ARM templates.
- Define CI/CD (continuous integration and continuous deployment) pipeline and managing releases in multiple environments (test, pre-prod, and production) using Azure pipelines and automation using Jenkins, Git, Docker, Kubernetes for ML models deployment.
- Created Azure Data Factory pipelines, used Azure Databricks notebook to cook the data, publish it in views to consume in Power BI reports.
- Experience in developing CI/CD (continuous integration and continuous deployment
- Used Kerberos, Azure AD, Sentry, and Ranger for maintaining security.
- Experience in building PySpark, Spark and Scala and SQL applications for interactive analysis, batch processing, and stream processing.
- Solid noledge in all phases ofData Acquisition,Data Warehousing( gathering requirements, design, development, implementation, testing, and documentation),Data Modeling(analysis using Star Schema and Snowflake for FACT and Dimensions Tables),Data Processing andData Transformations(Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters).
- Excellent in using Cloudera Manager for installation and management of single-node and multi-node Hadoop cluster (CDH4&CDH5).
- Closely collaborated wif business products, production support, engineering team on a regular basis for Diving deep ondata, Effective decision making and to support Analytics platforms.
- Worked in Hadoop cluster architecture and its key concepts - Distributed file systems, parallel processing, High availability, Fault tolerance and Scalability.
- Obtained and processed data from Enterprise applications, Clickstream events, API gateways, Application logs and database updates.
- Proficient at writing MapReduce jobs and UDF’s to gather, analyze, transform, and deliver the data as per business requirements.
- Strong working experience wif SQL and NoSQL databases, data modeling and data pipelines. Involved in end to end development and automation of ETL pipelines using SQL and Python.
- Proficient at using Spark API’s for streaming real time data, staging, cleansing, applying transformations and preparing data for machine learning needs.
- Worked wif various streaming ingest services wif Batch and Real-time processing using Spark streaming, Scala Confluent, Storm, and Sqoop.
- Experience in development of Bash scripting, T-SQL, and PL/SQL Scripts.
- Experience using Sqoop to ingest data from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
- Worked on NoSQL databases and its integration wif Hadoop cluster - HBase, DynamoDB.
- Leveraged different file formats Parquet, Avro, ORC and Flat files.
- Sound experience in building production ETL pipelines between several source systems and Enterprise Data Warehouse by leveraging Informatica PowerCenter, SSIS, SSAS and SSRS.
- Worked in both Agile and Waterfall environment. Used Git and SVN version control systems.
- Worked on developing Oozie workflows for scheduling and orchestrating the ETL process.
- Experience working in a fast-paced environment, both independently and in a collaborative way.
- Expertise in complex troubleshooting, root-cause analysis, and solution development.
TECHNICAL SKILLS
AWS Services: EMR, EC2, RDS, EBS, S3, Lambda, Glue, Elastic search, Kinesis, SQS, DynamoDB, Redshift, and ECS
Azure Services: HDInsight, Databricks, Data Lake, Blob Storage, AKS, Data Factory, Storage Explorer, SQL DB, Snowflake, and SQL DWH
Big Data Ecosystem: HDFS, MapReduce, Yarn, Spark, Kinesis, Hive, Impala, HBase, Oozie, Zookeeper, Nifi, Sqoop, Sentry, Ranger, Pig, Elastic Search
Hadoop Services: Hadoop, Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP
Relational Databases: Microsoft SQL Server, MySQL, Oracle, DB2, PostgreSQL, Spark SQL, Teradata
Programming Languages: Python, Pyspark, Scala, SQL, Shell Scripting, HiveQL
Build Tools: Jenkins, Maven, ANT
ETL Tools: Informatica, Talend
NO SQL Databases: HBase, MongoDB
Version Tools: Git, SVN, Bitbucket
Development Methodologies: Agile, Waterfall
Others: Terraform, Docker, Kubernetes, Jenkins, Splunk, Jira.
PROFESSIONAL EXPERIENCE
Confidential, New York
AWS Data Engineer
Responsibilities:
- Developing AWS Lambda functions which creates the EMR Cluster and auto terminates the cluster after job is done.
- Developed data transition programs from DynamoDB to AWS Redshift (ETL Process) using AWS Lambda by creating functions in Python for the certain events based on use cases.
- Involved in building a data pipeline and performed analytics using AWS stack (EMR, EC2, EBS, Elastic search, Kinesis, SQS, DynamoDB, S3, RDS, Lambda, Glue, SQS, and Redshift).
- Writing Pyspark Applications which runs on Amazon EMR cluster dat fetches data from the Amazon S3/one lake location and queue it in the Amazon SQS (simple Queue Services) queue.
- Built Kinesis dashboards and applications dat respond to incoming data using AWS SDKs, exported data from kinesis to other AWS services, such as EMR for analytics, S3 for storage, Redshift for extensive data, and Lambda for event-driven actions.
- Working on the Spark SQL for analyzing and applying the transformations on data frames created from the SQS queue and loads them into Database tables and querying.
- Working on Amazon S3 for persisting the transformed Spark Data Frames in S3 buckets and using Amazon S3 as a Data-lake to the data pipeline running on spark.
- Developing logging functions which stores logs of the pipeline in Amazon S3 buckets.
- Developing Email reconciliation reports for ETL load in Spark framework.
- Building PySpark applications for interactive analysis, batch processing, stream processing.
- Configuring Spark executor memory to speed-up spark jobs, developed unit tests for PySpark jobs, and perform tuning by analyzing Spark logs and job metrics.
- Worked wif Data Science team running Machine Learning models on Spark EMR cluster and delivered the data needs as per business requirements.
- Utilized Spark’s in memory capabilities to handle large datasets on S3 Data Lake. Loaded data into S3 buckets then filtered and loaded into Hive external tables.
- Strong Hands on experience in creating and modifying SQL stored procedures, functions, views, indexes, and triggers.
- Developing End to End ETL Data pipeline dat take the data from surge and Loading it into the RDBMS using the Spark.
- Worked on AWS Step Functions to automate and orchestrate the Amazon SageMaker related tasks such as publishing data to S3, training ML model and deploying it for prediction.
- Developing Data load functions, which reads the schema of the input data and load the data into a table.
- Involved in extracting and enriching multiple DynamoDB table transformations.
- Automated the process of transforming and ingesting terabytes of monthly data using AWS Kinesis, S3, Lambda and Oozie.
- Performed ETL operations using Python, SparkSQL, S3 and Redshift on terabytes of data to obtain customer insights.
- Involved in writing Python scripts to automate the process of extracting weblogs using Airflow DAGs.
- Involved in writing unit tests, worked along wif DevOps team in Installing libraries, Jenkins agents and ETL jobs.
- Used Ansible to provision the environment and deployed applications in a CI/CD process using a Jenkins pipeline. Also managed and deployed configurations using Terraform.
- Experience wif analytical reporting and facilitating datafor Tableau dashboards.
- Used Git for version control and Jira for project management, tracking issues and bugs.
Environment: Hadoop 2.x, Hive v2.3.1, Spark v2.1.3, AWS, EC2, S3, Lambda, Glue, Elasticsearch, RDS, DynamoDB, Redshift, ECS, Python, SQL, Sqoop v1.4.6, AWS Kinesis, Airflow v1.9.0, Oracle, Teradata, Tableau, Git, Jira.
Confidential, New York
Azure Data Engineer
Responsibilities:
- Experience in working wif Azure cloud platform (HDInsight, Databricks, Data Lake, Blob, Azure Data Factory, Azure Kubernetes Services, Synapse, SQL DB, and SQL DWH).
- Performed data cleansing and applied transformations using Databricks and Spark data analysis.
- Designed and automated Custom-built input adapters using Spark, Sqoop and Oozie to ingest and analyze data from RDBMS to Azure Datalake.
- Involved in developing automated workflows for daily incremental loads, moved data from traditional RDBMS to Data Lake.
- Used Azure Data Factory, SQL API and Mongo API and integrated data from MS SQL, MongoDB, and cloud (Blob, Azure SQL DB).
- Experience in loading data into Snowflake DB in the cloud from various sources and validating the data feed from the source system to Snowflake DW Cloud platform.
- Extensive noledge in managing resources across the cluster using Azure Kubernetes Service and automating CI/CD pipelines for ML models deployment using Kubernetes, Docker, etc.
- Created Automated ETL jobs in Talend and pushed data to Azure SQL Data Warehouse and processing the data in Azure Databricks.
- Monitored Spark cluster using Log Analytics. Transitioned log storage from MS SQL to Cosmos DB and improved the query performance.
- Involve in creating database objects like tables, views, stored procedures, triggers, packages, and functions using T-SQL to provide structure and maintain dataefficiently.
- Extensive experience in working wifSQL, wif strong noledge of T-SQL (MSSQLServer).
- Used Azure Synapse to manage processing workloads and served data for BI and prediction needs.
- Developed Spark scripts for mining data and performed transformations on large datasets to provide real time insights and reports.
- Extensively used Databricks notebooks for interactive analytics using Spark APIs.
- Supported analytical platform, handled data quality, and improved the performance for higher order functions, lambda expressions, pattern matching and collections.
- Optimized existing python code and improved the cluster performance.
- Developed JSON Scripts to deploy the pipeline in Azure Data Factory and JSON file format for StreamSets.
- Involved in building an Enterprise Data Lake using Data Factory and Blob storage, enabling other teams to work wif more complex scenarios and ML solutions.
- Extensive noledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
- Working wif data science team to do pre-processing and feature engineering and assisted Machine Learning algorithm running in production.
- Reduced access time by refactoring data models, query optimization and implemented Redis cache to support Snowflake.
- Facilitated data for interactive Power BI dashboards and reporting purposes.
Environment: Azure (HDInsight, Databricks, Data Lake, Blob Storage, MongoDB, Talend, Data Factory, SQL DB, SQL DWH, AKS), Python, Hadoop 2.x, Spark v2.0.2, NLP, Airflow v1.8.2, Hive v2.0.1, Sqoop v1.4.6, HBase, Oozie, Cosmos DB, MS SQL, Power BI, Azure DevOps, Git, Kubernetes, Docker.
Confidential, New York
Cloud Data Engineer
Responsibilities:
- Designed and deployed multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
- Used AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
- Built big data ecommerce solutions to improve business processes using kinesis agent to ingest data, kinesis data streams to stream data in real time, Lambda to process the data, populating the data into DynamoDB and sending notifications by email.
- Improved data warehousing and visualization using server less services such as Glue on S3 data lakes for data catalog,
- Contributed to the design and implement infrastructure in AWS by launching and configuring EC2, S3, IAM, VPC, Security groups, auto scaling and Load Balancers (ELBs) using Terraform and Ansible.
- Involved in migrating large amounts of data from on-prem Cloudera cluster to EC2 instances deployed on Elastic MapReduce (EMR) cluster.
- Gatheird data and performed analytics using AWS stack (EMR, EC2, S3, RDS, and Lambda).
- Developed an ETL pipeline to extract archived logs from disparate sources and stored in S3 Data Lake.
- Implemented Spark Java UDF's to handle data quality, filter, and data validation checks.
- Analyzed and optimized pertinent data stored in Snowflake using PySpark and SparkSQL.
- Responsible for building scalable and distributed data solutions using Cloudera CDH.
- Worked wif Impala for massive parallel processing of queries for ad-hoc analysis. Designed and developed complex queries using Hive and Impala for a logistics application.
- Created Bash scripts to add dynamic partitions to Hive staging tables. Responsible for loading bulk amount of data into HBase using MapReduce jobs.
- Loaded data from web servers using Spark Streaming API.
- Involve in creating database schema and objects like tables, views, stored procedures, triggers, packages, and functions to provide structure and maintain dataefficiently.
- Managed and deployed configurations for the entire datacentre infrastructure using Terraform.
- Presented creative business insights wif KPI reports and delivered actionable insights by identifying significant and correlated variables.
- Involved in setting up CI/CD pipelines using Jenkins. Worked along wif DevOps team and managed the Jenkins integration service wif Puppet.
- Also worked on resolving several tickets generated when issues arise in production pipelines.
- Used Kerberos for authentication and Apache Sentry for authorization.
- Used Git for version control, Interacted wif Onsite team for deliverables.
Environment: Cloudera CDH4, Zeppelin, Hadoop 2.x, AWS (EMR, EC2, S3, Lambda, RDS, DynamoDB), Snowflake, Spark v1.6.0, Python, Java, Scala, Sqoop v1.4.5, Hive v1.2.1, Impala, HBase, Spring Boot, StreamSets, Oozie v4.1.0, Terraform, Zookeeper, Druid, live, PostgreSQL, Jenkins, Arcadia, Sentry, Kerberos.
Confidential
Cloud Data Engineer
Responsibilities:
- Played a key role in gathering business requirements, system and design requirements, gap analysis, use case diagrams and flow charts.
- Performed ETL operations using wif Informatica PowerCenter to - data extraction, staging, apply transformations and stored in target data center.
- Parsed complex files using Informatica Data Transformations (normalizer, Lookup, Source Qualifier, Expression, Aggregator, Sorter, Rank and Joiner) and loaded them into databases.
- Created complex SQL queries and scripts to extract, aggregateand validatedata from MS SQL, Oracle, and flat files using Informatica and loaded into a single Data Warehouse repository.
- Involve in creating database objects like tables, views, stored procedures, triggers, packages, and functions using T-SQL to provide structure and maintain dataefficiently.
- Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into target databases/Data Warehouses.
- Involved in writing python scripts to extract data from different API’s.
- Responsible for collecting, scrubbing, and extractingdata, generated compliance reports using SSRS, analyzed and identified market trends to improve product sales.
- Experienced in writing MapReduce programs & UDF's for both Pig & Hive in data to pre-process the data for analysis.
- Validated Sqoop jobs, Shell Scripts and perform data validation to check if data is loaded correctly wifout any discrepancy.
- Performed migration and testing of static data and transaction data from one core system to another.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the Data science team.
- Performed data profiling, answered complex business questions by providing data to business users.
- Generated DDL and created the tables and views in the corresponding architectural layers.
- Extract, transform and analyze measures/indicators from multiple sources to generate reports, dashboards, and analytical solutions.
Environment: PySpark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, AWS, ETL, UNIX, Linux, Tableau, Teradata, Pig, Sqoop, Oozie, Scala, Python, GIT.
Confidential
Python Developer
Responsibilities:
- Worked wif Hortonworks distribution. Installed, configured, and maintained a Hadoop cluster based on the business requirements.
- Involved in end to end implementation of ETL pipelines using Python and SQL for high volume analytics, also reviewed use cases before onboarding to HDFS.
- Responsible to load, manage and review terabytes of log files.
- Involved in writing rack topology scripts and Java map reduce programs to parse raw data.
- Migrated from JMS solace to Apache Kafka, used Zookeeper to manage synchronization, serialization, and coordination across the cluster.
- Used Sqoop to migrate data between traditional RDBMS and HDFS. Ingested data, from MS SQL, Teradata, and Cassandra databases.
- Identified required tables, views and exported them into Hive. Performed ad-hoc queries using Hive joins, partitioning, bucketing techniques for faster data access.
- Used Nifi to automate the data flow between disparate systems. Designed dataflow models and complicated target tables to obtain relevant metrics from various sources.
- Developed Bash scripts to get log files from FTP server and executed Hive jobs to parse them.
- Implemented various Hive queries for analytics. Created External tables, optimized Hive queries and improved the cluster performance by 30%.
- Performed data analysis using HiveQL, Pig Latin and custom MapReduce programs in Java.
- Enhanced scripts of existing Python modules. Worked on writing APIs to load the processed data toHBasetables.
- Migrated ETL jobs to Pig scripts to apply joins, aggregations, and transformations.
- Worked wif Jenkins build and continuous integration tools. Involved in writing Groovy scripts to automate the Jenkins pipeline's integration and delivery service.
- Used Power BI as a front-end BI tool and MS SQL Server as a back-end database to design and develop dashboards, workbooks, and complex aggregate calculations.
- Troubleshooted defects by identifying root cause and fixed them during QA phase.
- Used Jenkins for CI/CD and SVN for version control.
- Played a key role in gathering business requirements, system and design requirements, gap analysis, use case diagrams and flow charts.
- Performed ETL operations using wif Informatica power center to - data extraction, staging, apply transformations and stored in target data center.
- Parsed complex files using Informatica Data Transformations (normalizer, Lookup, Source Qualifier, Expression, Aggregator, Sorter, Rank and Joiner) and loaded them into databases.
- Responsible for collecting, scrubbing, and extractingdata, generated compliance reports using SSRS, analyzed and identified market trends to improve product sales. Performed data profiling, answered complex business questions by providing data to business users.
- Generated DDL and created the tables and views in the corresponding architectural layers.
- Extract, transform and analyze measures/indicators from multiple sources to generate reports, dashboards, and analytical solutions.
Environment: Hortonworks 2.0, Hadoop, Hive v1.0.0, HBase, Sqoop v1.4.4, Pig v0.12.0, Zookeeper, Kafka v0.8.1, Nifi, Ambari, Python, SQL, Java, Teradata, MS SQL, Cassandra, Power BI, Jenkins, SVN, Jira, Python, Informatica v9.x, MS SQL SERVER, T-SQL, SSIS, SSRS, SQL Server Management Studio, Oracle, Excel.
Confidential
Java Developer
Responsibilities:
- Developed an IAM application using spring, JavaEE, Oracle, Okta, Redis and Postman wif microservices architecture and REST services.
- Involved in all the phases of Software Development Life Cycle (requirements gathering, analysis, design, development, testing, and maintenance).
- Incorporated UML diagrams (Class diagrams, Activity diagrams, Sequence diagrams) as part of design documentation and system documentation.
- Involved in CI/CD process using Jenkins and GIT. Migrated from SiteMinder (Single Sign On) to OKTA and OAuth 2.0. Used Junit and Mockito frameworks for writing unit tests.
- Implemented features like logging, auditing, session validation using Spring AOP and IOC modules. Performed data operations by wiring Spring ORM wif Hibernate.
- Used Log4J for logging and notification tracing mechanisms, Splunk for analyzing application performance, JFrog to store the artifacts.
- Worked on front-end change requests where me tweaked code and added some functionality usingJavaScript,HTML and CSS.
- Coordinated tasks wif other developers on the team to meet deadlines. Responsible to complete and update the status of Jira tickets assigned to me on time.
- Involved in production deployment procedures such as silencing monitors, deploying releases, testing all the login flows.
Environment: Java/J2EE, Spring, Oracle, PL/SQL, Redis, Linux, JDBC, maven, Git, Jira, Okta, OAuth 2.0, Hibernate, Jenkins, HTML, CSS, JavaScript, XML, Junit, Mockito, Postman, Tomcat, Splunk, Log4j, JFrog.