Aws Data Engineer Resume
Seattle, WA
SUMMARY
- Around 7+ years of experience in Data warehousing wif exposure to Design, Modelling, Development, Testing, Maintenance, and customer support environments on multiple domains.
- Extensive experience in Azure Cloud, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HDInsight, Big Data Technologies (Hadoop and ApacheSpark) and Data bricks.
- Experience in designing and implementation of cloud architecture on Microsoft Azure.
- Excellent knowledge on integrating Azure Data Factory V2/V1 wif variety of data sources and processing teh data using teh pipelines, pipeline parameters, activities, activity parameters, manually/window based/event - based job scheduling.
- Hands-on experience in developing Logic App workflows for performing event-based data movement, perform file operations on Data Lake, Blob Storage, SFTP/FTP Servers, getting/manipulating data in Azure SQL Server.
- Implemented Azure Active Directory Service for authentication of Azure Data Factory.
- Extensively worked on AWS services like EC2, S3, EMR, RDS, Athena, Lambda Function, Step Function, Glue Data Catalog, SNS, RDS(Aurora), Redshift
- Worked on Data Warehouse design, implementation, and support (SQL Server, Azure SQL DB, Azure SQL Data warehouse).
- Experience in implementing in ETL and ELT solutions using large data sets.
- Experience in creating database objects such as Tables, Constraints, Indexes, Views, Indexed Views, Stored Procedures, UDFs and Triggers on Microsoft SQL Server.
- Strong experience in writing & tuning complex SQL queries including joins, correlated sub queries and scalar sub queries.
- Identify, design, and implement process improvements through automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability.
- Experience & Involved in all phases of SDLC process - Requirement Gathering, Analysis, Design, Coding, Code reviews, Configuration control, QA & deployment.
- Experience in Agile/SCRUM methodology.
- Developed and Designed automation framework using Python and Shell Scripting
- Proficient in converting Hive/SQL queries into Spark transformations using Data frames and Data sets
- Experience in working wif GIT, Bitbucket Version Control System
- Responsible for using Flume sink to remove teh data from Flume Channel and deposit in No-SQL database like MongoDB
- Knowledge in using Integrated Development environments like Eclipse, NetBeans, IntelliJ, STS
- Involved in loading data from UNIX file system and FTP to HDFS
- Hands on Experience in using Visualization tools like Tableau, Power BI
- Good knowledge on Data modeling and Data Analytics tools& Exposure to different big data platforms.
- Accomplished complex HiveQL queries for required data extraction from Hive tables and written Hive User Defined Functions (UDF's) as required
- Capable of understanding and knowledge of job workflow scheduling and locking tools/services like Oozie, Zookeeper, Airflow and Apache NiFi
- Strong experience and knowledge of real time data analytics using Spark Streaming and Flume
TECHNICAL SKILLS
Hadoop/Big Data Ecosystem: Apache Spark, HDFS, MapReduce, HIVE, Kafka, Sqoop
Programming & Scripting: Python, Pyspark, SQL, Scala
NoSQL Databases: Mongo DB, Dynamo DB
SQL Databases: MS-SQL Server, MySQL, Oracle, Postgress
Cloud Computing: AWS, Azure
Operating Systems: Ubuntu (Linux), Mac OS-X, Windows 10, 8
Reporting: Power BI, Tableau
Version Control: Git, GitHub, SVN
Methodologies: Agile/ Scrum, Rational Unified Process and Waterfall
PROFESSIONAL EXPERIENCE
Confidential, Seattle, WA
AWS Data Engineer
Responsibilities:
- Created end to end data pipeline which includes data ingestion, data curation, data provision using AWS cloud services.
- Developed Spark applications using Python and implemented Apache Spark data processing project to handle data from both Batch and Streaming sources
- Worked wif teh Spark for improving performance and optimization of teh existing algorithms in Hadoop.
- Ingested data into S3 bucket from different sources including MySQL, Oracle, MongoDB, SFTP.
- Proficient in working wif AWS services like S3, EC2, EMR, Redshift, Athena, Glue, DynamoDB, RDS, IAM
- Working wif Big data services/concepts like Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL, and Spark Streaming
- Created EMR cluster on EC2 instances and developed PySpark applications to perform data transformations on top of it and stored into Redshift
- Configured AWS Identity and Access Management (IAM) Groups and Users for improved login authentication
- Used AWS Lambda using Python scripts to automate teh operation to read data types
- (Parquet, JSON, Avro, CSV) from AWS S3 to AWS Redshift
- Created an event-driven AWS Glue ETL pipeline using Lambda function by reading teh data from teh S3 bucket and storing it in Redshift on daily basis
- Developed python script using Boto3 library to configure teh services AWS glue, EC2, S3, DynamoDB
- Performed tuning of Spark Applications to set batch interval time and teh correct level of Parallelism and memory tuning
- Used Spark Streaming APIs to perform transformations and actions for teh data coming from Kafka in real-time and persists it to AWS S3
- Developed Kafka consumer API in python for consuming data from Kafka topics
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for Data analysis and engineering type of roles
- Extracted teh data from HDFS using Hive and performed data analysis using PySpark, Redshift for feature selection and created nonparametric models in Spark
- Used SQL commands DDL, DQL, DML For creating, selecting, and modifying tables
- Used AWS Redshift, S3, and Athena services to query large amounts of data stored on S3 to create a Virtual Data Lake wifout having to go through teh ETL process.
- Written several Map reduce Jobs using PySpark, NumPy and used Jenkins for Continuous integration.
- Involved in writing custom MapReduce programs using Java API for data processing.
- Designed and Developed ETL Processes in AWS Glue to migrate Campaign data from external sources wif file formats like ORC/Parquet/Text Files into AWS S3
- Imported and exported data jobs, to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code using SQL for faster testing and processing of data
- Developed Sqoop and Kafka Jobs to load data from RDBMS into HDFS and HIVE
- Worked in teh development of applications especially in teh LINUX environment and familiar wif all its commands and worked on Jenkins continuous integration tool for deployment of teh project and deployed teh project into Jenkins using GIT version control system.
Environment: Spark, Spark-Streaming, PySpark, Spark SQL, AWS EMR, S3, EC2, Redshift, Athena, Lambda, Glue, DynamoDB, MapReduce, Java, HDFS, Hive, Pig, Apache Kafka, Python, Shell scripting, Linux, MySQL, NoSQL, SOLR, Jenkins, Oracle, Git, Airflow, Tableau, Power BI.
Confidential, Plano, TX
Azure Data Engineer
Responsibilities:
- Implemented Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and unstructured data to meet business functional requirements.
- Design and developed Batch processing and real-time processing solutions using ADF and Databricks clusters
- Created numerous pipelines in Azure using Azure Data Factory v2 to get teh data from disparate source systems by using different Azure Activities like Transform, Copy, for each, Databricks etc.
- Maintain and provide support for optimal pipelines and complex data transformations and manipulations using ADF and PySpark wif Databricks.
- Automated jobs using different triggers like Events, Schedules and Tumbling in ADF.
- Created, provisioned different Databricks clusters, notebooks, jobs and autoscaling.
- Implemented Azure, self-hosted integration runtime to access private network data.
- Used Azure Logic Apps to develop workflows which can send alerts/notifications on different jobs in Azure.
- Experienced in developing audit, balance and control framework using SQL DB audit tables to control teh ingestion, transformation, and load process in Azure.
- Created Linked services to connect teh external resources to ADF.
- Working wif complex SQL views and Stored Procedures in large databases from various servers.
- Ensure teh developed solutions are formally documented and signed off by business.
- Worked wif team members to resolve any technical issue, Troubleshooting, Project Risk & Issue identification, and management.
- Worked on teh cost estimation, billing, and implementation of services on teh cloud.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked wif Cosmos DB (SQL API and Mongo API)
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API
- Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion
- Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL)
- Developed map reduce jobs using Scala for compiling teh program code into bytecode for teh JVM for data processing
- Proficient in utilizing data for interactive Power BI dashboards and reporting purposes based on business requirements.
Environment: Azure HDInsight, DataBricks, DataLake, CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Azure DevOps, Ranger, Azure AD, Git, Blob Storage, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.
Confidential, Raleigh, NCData Engineer
Responsibilities:
- Experienced working wif Big Data, Data Visualization, Python Development, SQL, and UNIX.
- Expertise in quantitative analysis, data mining, and teh presentation of data to see beyond teh numbers and understand trends and insights
- Handled high volume of day-to-day Informatica workflow migrations
- Review of Informatica ETL design documents and working closely wif development to ensure correct standards are followed
- Efficient in writing Python scripts to build ETL pipeline and Directed Acyclic Graph (DAG) workflows using Airflow, Apache NiFi Tasks are distribution on celery workers to manage communication between multiple services
- Monitored Spark cluster using Log Analytics and Ambari Web UI. Transitioned log storage from Cassandra to Azure SQL Datawarehouse and improved teh query performance
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked wif Cosmos DB (SQL API and Mongo API)
- Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS
- Loaded data from Web servers and Teradata using Sqoop, Flume and Spark Streaming API
- Used Flume sink to write directly to indexers deployed on cluster, allowing indexing during ingestion
- Migrated from Oozie to Apache Airflow. Involved in developing Oozie and Airflow workflows for daily incremental loads, getting data from RDBMS (MongoDB, MS SQL)
- Identify and document limitations in data quality dat jeopardize teh ability of internal and external data analysts and wrote standard SQL Queries to perform data validation and created excel summary reports (Pivot tables and Charts) as well as gathered analytical data to develop functional requirements using data modeling and ETL tools
- ETL Data Cleansing, Integration &Transformation using Hive and PySpark. Responsible for managing data from disparate sources
- Used Spark optimizations techniques like Cache/Refresh tables, broadcasting variables, Coalesce/Repartitioning, increasing memory overhead limits, handling parallelism, and modifying teh spark default configuration variables for performance tuning
- Read data from different sources like CSV file, Excel, HTML page, and SQL and performed data analysis and written to any data source like CSV file, Excel, and database
- Developed and handled business logic through backend Python code.
Environment: Python, UNIX, SQL, ETL, Informatica, Spark, HTML, Azure HDInsight, DataBricks, DataLake, CosmosDB, MySQL, Snowflake, MongoDB, Teradata, Ambari, Flume, VSTS, Tableau, PowerBI, Data Factory, Data Storage Explorer, Scala, Hadoop 2.x (HDFS, MapReduce, Yarn), Spark v2.0.2, Airflow, Hive, Sqoop, HBase.
Confidential
Big Data Engineer
Responsibilities:
- Transformed data through Merge join, Derived columns, Conditional Splits, Lookup, Union All, sort and Slowly changing Dimension transformations.
- Created conditional data flow through Script components, Expressions and Variables.
- Developed packages to be configured on various application environments through Project Parameters, Environment variables and Package level variables.
- Created database objects like SQL views, Synonyms, functions, and stored procedures to be used by Application team and Reporting team.
- Involved in debugging complex SSIS packages, SQL objects and SQL job workflows.
- Utilized TFS for source control, tasks, and bugs management.
- Involved in gathering & analysis of business Requirements from End users & internal Business Analysts to Develop Strategies for ETL processes.
- Performed data analysis for complex business issues to provide possible recommendations to resolve business problems.
- Experience inPerformance Tuning- Identified and fixed bottlenecks and tuned teh complex mappings for better Performance.
- Involved in requirements analysis and legacy system data analysis to design and implement ETL jobs using Microsoft SQL Server Integration Services (SSIS)
- Analyzed teh data sources from Oracle, SQL Server for design, development, testing, and production rollover of reporting and analysis projects wifin x Desktop
- Partnered wif Development and Quality Assurance teams to ensure teh product quality is always intact
- Resolved Production Support issues using SQL, PL SQL to prevent any interference caused to teh business
- Led a team of 3 interns to develop prototypes of predictive models using MS Excel to forecast teh data team's future infrastructure costs (Software, Hardware, Development and Operational) for teh growing data based on historical data
- Estimated a 5% improvement in operational efficiency wif implementation of predictive analytics to estimate future infrastructure costs (Software, Hardware, Development and Operational)
- Analyzed teh supply chain management revenue reports to understand teh business profits generated from teh existing suppliers and sub-suppliers for circuit breaker components.
Environment: Python, SQL, ETL, Oracle, SSIS, PL SQL, MS Excel