We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

4.00/5 (Submit Your Rating)

Irving, TexaS

SUMMARY

  • 8+ years of IT experience in Big Data Analytics, HDFS, YARN, Hadoop Ecosystem, MapReduce, and Shell Scripting.
  • Real - time experience with Hadoop Ecosystem components like HDFS (Storage), Map Reduce (Processing), YARN (Resource Management), Pig, Sqoop, Hive, Oozie, HBase, Zookeeper, Scala, Git, AVRO, JSON, and Spark for data storage and analysis.
  • Expertise on AWS databases - RDS, Aurora DB.
  • Profound experience in using AWS products like S3, EC2, Redshift, and EMR, Elasticsearch.
  • Proficient in working with Amazon EC2 to provide a complete solution for query processing, computing, and storage across a wide range of applications.
  • Expertise in Apache Spark cluster and processing of stream data using Spark Streaming.
  • Expertise in developing MapReduce jobs in Python for data cleaning and preprocessing.
  • Proficient in handling data manipulation using python Scripts and experience in writing Python scripts for system management.
  • Highly capable of creating and monitoringHadoop clusterson Horton Works Data Platform, MapR, CDH5 ClouderaManager on Ubuntu.
  • Proficient in transferring data between a Hadoop ecosystem and structured data storage in a RDBMS such as MY SQL, Teradata Oracle, and DB2 using Sqoop.
  • Highly capable in writing Python scripts to design ETL pipeline and Directed Acyclic Graph (DAG) workflows using Apache NiFi, Airflow.
  • Expertise in Python Scripting using Pandas, visualization using Matplotlib and NumPy for structuring data.
  • Proficient in working with ETL methods for data transformation, extraction and loading in corporate-wide ETL Solutions and Data Warehouse tools for reporting and data analysis.
  • Designed airflowfor efficiently collecting, aggregating, and moving huge amounts of log data.
  • Profound experience in implementation of airflowdags using Python to configure and manage data flow.
  • Real - time experience onAd-hocqueries, Aggregation, Replication, Indexing, Load balancing in DynamoDB.
  • Expertise in writing Pig Latin, Hive Scripts and extended their functionality using User Defined Functions (UDF's) and in maintaining of data (Data Layout's) using Partitions and Bucketing in Hive.
  • Proficient in developing Data Visualization's using Tableau Software from diverse sources.
  • Good experience in writing queries for retrieving and ingesting data into redshift warehouse.
  • Expertise in various performance optimization techniques like using distributed cache for small datasets, Map Side joins, Bucketing in Hive and Partition.
  • Multiple years of experience on Azure Cloud, Azure Data Lake Storage, Azure Data Factory, Azure Synapse Analytics, Big Data Technologies (Apache Spark), AZURE Analytical services, and Data Bricks.
  • Profound Experience in Azure Data Factory V2 with a range of data sources and processing the data using pipelines, pipeline parameters, activities, activity parameters, and manually/window- based/event-based task scheduling.
  • Developed a connection from Azure to an on-premises data center with the help of Azure Express Route for Single and Multi-Subscription.
  • Expertise in Hadoop cluster architecture and its key concepts - Parallel processing, Distributed file systems, High availability, Scalability and Fault tolerance.
  • Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
  • Real - time experience in using Visualization tools like Power BI, Tableau.
  • Excellent Interpersonal skills, Communication skills, problem solving skills and a team player. Able to quickly adapt to new technologies and environment.

TECHNICAL SKILLS

Technologies: Hadoop, MapReduce, HDFS, Hive, Pig, Spark, Impala, Nifi, Ambari, Sentry, Sqoop, Ranger, Oozie, Zookeeper, Flume, Cloudera, Hortonworks, & Snowflake

RDBMS/Databases: HBase, MySQL, Oracle, SQL Server, Mongo DB, DB2, Teradata

AWS: EC2, S3, EMR, VPC, Elastic Load Balancing, Cloud Front, CloudWatch, SQS, and Lambda, Kinesis stream, Route 53, CloudTrail, and IAM

Azure: Azure Synapse Analytics, Azure Data Lake Storage, Azure Data Factory, Azure Stream Analytics, Azure Databricks, Azure Log Analytics, and Azure Blob stockpiling

Programming Languages: Python,Pyspark, Scala, SQL

Methodologies: Agile, Waterfall

Data Warehousing: Redshift, Azure SQL Data warehouse

Version Control: Git, SVN.

Reporting Tools: Crystal Reports, Tableau, Quick sight, PowerBI

PROFESSIONAL EXPERIENCE

Confidential | Irving, Texas

AWS Data Engineer

Responsibilities:

  • Developed AWS Pipelines by extracting customer's datafrom variousdataresources into Hadoop HDFS and includedthe datafrom Excel, Oracle, Flat Files, SQL Server, logdata and Teradatafrom servers.
  • Creation of data pipelines for cleaning, gathering, and transforming data using Spark, Hive, and usedSpark-Streaming APIsto make necessary changes for developing the common learner data model which gets the data fromAWS Kinesisin real time and persists.
  • Written and executed Spark code using Spark-SQL and Scala for quicker testing and processing of data and transforming it using Spark Context, Pair RDD's, Spark-SQL, Spark YARN.
  • Worked on Elasticsearch, Kibana (ELK stack) and Logstash for centralized logging and analytics in the continuous delivery pipeline to store logs and metrics into S3 bucket using lambdafunction.
  • Developed Lambda function to monitor the log files which will trigger the Lambda code when there are changes in log files.
  • Executed Python scripts to automate AWS Services which includes Cloud front, ELB, Lambda, database security and application configuration, also developed them to take backup of EBS volumes using CloudWatch, AWS Lambda.
  • Built data pipeline inAmazon AWS using AWS Glueto get the data from weblogs and store in HDFS.
  • Developed distributed frameworks such as Apache Spark and Presto in Amazon EMR, Redshift.
  • Find query duplication, dependency, and complexity to reduce migration efforts Technology stack: AWS Cloud, Oracle, and Dynamo DB.
  • Troubleshooting and Optimization, test case integration into CI/CD pipeline using docker images and implemented continuous integration & deployment (CICD) through Jenkins for Hadoop jobs.
  • Built Spark streaming application to extract data from cloud to hive table.
  • Used Spark SQL to process the vast amount of structured data and executed programs in Python using Spark.
  • Performed analysis and optimizations of RDD's by controlling partitions for the given data and expertise in executing business analytical scripts using Hive SQL.
  • Performed integration ofAWSKinesis for streaming with on premiseAWS Kinesis cluster and wrote automation scripting in Python to manage and deploy applications.
  • Built Hadoop Jobs for analyzing data using Pig, Hive accessing sequence files, Text format files, Parquet files.
  • Converted Hive/SQL queries into Spark transformations with the help of Python, Spark RDDs, and Scala.
  • Worked with Spark Streaming through core Spark API running Scala, to transform raw data into baseline data.
  • Executed Python programs with a variety of packages such as Matplotlib, NumPy, & Pandas.
  • Executed the SQL scripts & manipulated them for improving performance using PySpark SQL.

Environment: Hadoop, YARN, Spark, Pig, HIVE, SQL, PySpark, Python, Chef, AWS Lambda, AWS S3, Snowflake Database, AWS EMR, Dynamo DB, Redshift, Kinesis, HBase, NOSQL, Sqoop, MYSQL, Docker, Data Warehouse and ETL.

Confidential | Dallas, Texas

Azure Data Engineer

Responsibilities:

  • Performed data transfer from on-premises SQL servers to cloud databases (Azure SQL DB and Azure Synapse Analytics (DW)).
  • Constructed Pipelines in Azure Data Factory using Datasets/ Linked Services/Pipeline, load data, to extract, and to transform from various sources including Azure SQL, AzureSQL, Blob storage, write-back tool, Data warehouse, and reverse.
  • Built CI-CD Pipelines using Azure DevOps.
  • Worked on various services of Azure Data Factory, Spark SQL, T-SQL, and U-SQL Azure Data Lake Analytics to convert, gather, and load the data from source systems to Azure Data Storage services.
  • Worked on the creation of structured and semi-structured data into Spark Clusters with the help of Spark SQL and Data Frames API.
  • Built Spark apps with Azure Data Factory and Spark-SQL for transformation, data extraction, and aggregation from various file formats to analyze and transform the data to reveal insights into consumer usage patterns.
  • Performed ingestion of data into one or more Azure Services (Azure Storage, Azure Data Lake, Azure DW, Azure SQL) and processing of data in Azure Databricks.
  • Executed the SQL scripts using PySpark SQL.
  • Worked on Logic App to take decisional actions based on the workflow.
  • Worked on integration of data storage options with Spark, notably with Blob storage and Azure Data Lake Storage.
  • Worked on Kubernetes to manage the scaling, deployment, and management of Docker Containers.
  • Executed a proof of concept for Azure implementation, with a larger aim of transferring on-premises servers and data to the cloud.
  • Designed and built reusable data extraction, transformation, and loading processes by developing Azure Synapse pipelines
  • Performed analysis of data quality issues using Snow SQL by developing analytical warehouses on Snowflake.
  • Built UDFs in PySpark to meet the specific business requirements.
  • Executed Hive queries to analyze large data sets of unstructured, structured, and semi-structured data.
  • Made use of structured data in Hive to improve performance using sophisticated techniques including bucketing, perfecting self joins, and partitioning.
  • Worked on an Azure copy to load datafrom an on-premises SQL server to an Azure SQL Datawarehouse.
  • Implemented Data Validation, Aggregation, and AzureHDInsight using spark scripts written in Python.

Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure Data Lake, BLOB Storage, SQL Server, Windows remote desktop, AZURE PowerShell, Databricks, Python, Kubernetes, Azure SQL Server, Azure Data Warehouse.

Confidential | Atlanta, Georgia

Sr. Hadoop Developer

Responsibilities:

  • Developed MapReduce jobs to execute operations like copying data on HDFS and maintaining job flows on EC2 server, load and transform large sets of structured, semi-structured and unstructured data.
  • Built a process for Sqooping data from various sources like Oracle, SQL Server, and Teradata and responsible for developing mapping document from source fields to destination fields mapping.
  • Executed a shell script to create staging, landing tables with the same schema as the source and develop the properties used by Oozie jobs.
  • Maintained Oozie workflow for executing Sqoop and Hive actions and worked with NoSQL databases like HBase in developing HBase tables to load large sets of semi structured data coming from multiple sources.
  • Performed performance optimizations on Spark/Scala.
  • Executed Python wrapper scripts that extract specific date range using Sqoop by passing properties which has been developed according to the requirements from the workflow.
  • Wrote scripts to run Oozie workflows, in such away that it captures the logs of all jobs that run on cluster and build a metadata table containing the execution times of each job.
  • Pulled out the feeds from social media sites such as Facebook, Twitter using Python scripts and designed Spark Streaming to receive real time data from the AWS Kinesis and stored the stream data to HDFS.
  • CreatedDataLake by extracting customer's BigDatafrom multipledatasources into Hadoop HDFS which includesdatafrom Excel, Oracle, Flat Files, SQL Server, HBase, Mongo DB, Teradata, and logdatafrom servers
  • Worked on MapReduce (YARN) jobs for accessing, cleaning, and confirming the data.
  • Wrote scripts in Python programming language according to the user defined functions using both RDD/MapReduce and Data frames/SQL in Spark for Data Aggregation, quering and writing data back into RDBMS through Sqoop.
  • Performed data synchronization between EC2 and S3, profiling, Hive stand-up.
  • Implemented log producer in Scala that monitors application logs, transform, incremental log and sends them to an AWS Kinesis and Zookeeper based log collection platform.
  • Wrote Hive scripts for performing transformation logic and loading the data from staging zone to final landing zone.
  • Studied Parquet File format for better performance and storage to publish tables and involved in loaded transactional data into Hadoop distributed file system using Flume for Fraud Analytics.
  • Developed Python utility to validate Hadoop distributed file system tables with source tables
  • Designed and developed UDF'S to extend the functionality in both PIG and HIVE.
  • Built various AWS Kinesis Producers and Consumers from scratch as per the software requirement specifications.
  • Performed MapReduce Jobs in data cleanup, confirming and to perform ETL and wrote Hive/Impala queries for ad-hoc reporting, summarizations and ETL.
  • Developed Spark code using Spark-SQL and Scala for faster testing and processing of data and perfecting it using Spark-SQL, Pair RDD’s, Spark Context, Spark YARN.
  • Performed migration of data from Oracle, MySQL in to HDFS using Sqoop and importing various formats of flat files in to HDFS.

Environment: Hadoop, HDFS, Map Reduce, Hive, HBase, AWS Kinesis, Zookeeper, Oozie, Impala, Oracle, SQL, Teradata SQL Server, Python, UNIX Shell Scripting, ETL, Flume, Scala, Spark, Spark streaming, Sqoop, Python, Oracle, MySQL, Hortonworks, YARN, Python.

Confidential

Hadoop Developer

Responsibilities:

  • Performed analysis of large data sets and draw the customer usage patterns by developing new MapReduce programs.
  • Created Hive tables and loaded data incrementally into the tables using Dynamic Partitioning and Worked on Avro Files, JSON Records.
  • Communicated with application teams to install operating system,patches, Hadoopupdates, version upgrades as needed.
  • Created various internal and external tables on Hive and loaded it with data by executing Hive queries.
  • Performed offline analysis on HDFS and sent the results to MongoDB databases to update the information on the existing table.
  • Migrated from Hadoop to MongoDB with the help of Hive, MapReduce by connecting with Mongo-Hadoop connectors.
  • Extracted log files from multiple sources into HDFS using Flume.
  • Worked on tools Storm, Flume and Spark for extracting vast amount of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Used TOAD to develop and run SQL queries.
  • Developed dashboards for the leadership team with the conversion KPI's & also involved in scheduling / building of dashboards for the middle-management to provide them a snapshot of the business.
  • Performed analysis of business requirements and developed logical data model that describes all data and relationships between data using SQL.
  • Designed workflow in Oozie to automate the tasks of loading data into HDFS and pre-processing with Hive and Pig.
  • Developed dashboards on Tableau Server and generated reports for Hive tables in different scenarios using Tableau
  • Monitored Active Batch jobs and Cron jobs and involved in Jar builds that can trigger by commits to GitHub using Jenkins.
  • Studying new tools for data tagging like Tealium (POC Report)

Environment: Hadoop, HDFS, Map Reduce, Pig, HBase,Hive, Zookeeper, Impala, Oozie,Cloudera, Windows NT, Oracle 11g/10g, UNIX Python, Shell Scripting, Tealium, Tableau, SQL.f

Confidential

Data Analyst

Responsibilities:

  • Designed, developed, and performed testing of sessions, mappings, workflows to transfer the data from policy center to BIC- Business Intelligence Center.
  • Designed solution to decide at which stage of a policy life cycle, an underwriting issue occurred.
  • Performed root cause analysis for the cancellation of policy using SAS and SQL.
  • Monitored gathering requirements, user interviews, analyzing, and prioritizing Product Backlog.
  • Built Use Cases, flow diagrams and business functional requirements for Scrum.
  • Built interactive cohort analysis report in Tableau.
  • Created forecasting using trend lines, parameters, and reference lines.
  • Implemented security guidelines by using user filters and row-level security.

Environment: Data Warehousing, Python/R, Snowflake, Redshift, Data Visualization- SAS/Tableau, Data Science Research Methods- Power BI, Statistical Computing Methods, and Experimental Design & Analysis JSON, SQL, PowerShell, Git, and GitHub.

We'd love your feedback!