We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

2.00/5 (Submit Your Rating)

Minneapolis, Mn

PROFESSIONAL SUMMARY:

  • Around 9+ years of experience as a Big Data Engineer with expertise in the Hadoop Ecosystem and AWS cloud services,
  • Hands on experience on building the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources like NoSQL, SQL, AWS & Big Data technologies (Dynamo, Kinesis, S3, HIVE/Spark).
  • Super - eminent understanding of AWS (Amazon Web Services) includes S3, Amazon RDS, IAM, EC2, Redshift, Apache Spark RDD concepts and developing Logical Data Architecture with adherence to Enterprise Architecture.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Amazon EMR.
  • Integrated Snowflake cloud data warehouse and AWS S3 bucket from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Experience in AWS EC2, configuring the servers for Auto scaling and Elastic load balancing.
  • Worked on migrating an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in Maintaining the Hadoop cluster on AWS EMR.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as big data Processing, Ingestion, Analytics and Cloud Migration from On-prem to AWS Cloud using AWS EMR, S3, DynamoDB
  • Developed ETL pipelines in and out ofdatawarehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Demonstrated understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.
  • Designed and developed logical and physical data models that utilize concepts such asStar Schema, Snowflake Schemaand Slowly Changing Dimensions.
  • Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, HBase, Hive, Oozie, Impala, Pig, Zookeeper and Flume, Kafka, Sqoop, Spark.
  • Built real timedata pipelinesby developingKafkaproducers andSparkstreaming applications for consuming.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, Pair RDD's and worked explicitly on PySpark.
  • Familiar with data processing performance optimization techniques such as dynamic partitioning, bucketing, file compression, and cache management in Hive, Impala, and Spark.
  • Experience in DimensionData modelingconcepts likeStarJoinSchema Modeling, Snow-Flake Modeling,FACTandDimensions Tables,PhysicalandLogical Data Modeling.
  • Created Partitions and Bucketing Hive tables in Parquet File Formats with Snappy compression and then loaded data into Parquet hive tables from Avro hive tables.
  • Experience in designing and implementing RDBMS Tables, Views, User Generated Data Types, Indexes, Stored Procedures, Cursors, Triggers, and Transactions.
  • Strong team player, ability to work alone as well as in a team, capacity to adapt to a quickly changing environment, dedication to learning. Have good communication, project management, documentation, and interpersonal skills.
  • Experience in creating and maintaining reporting and analytics infrastructure for internal business clients utilizing AWS services such as Athena, Redshift, Spectrum, EMR, and Quick Sight.

TECHNICAL SKILLS:

Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Hive, Sqoop, Cloudera, Yarn, Oozie, Storm, Flume.

Cloud Platform: AWS, EC2, S3, SQS, lambda, Docker, EMR, Redshift

Streaming Technologies: Storm, Spark Streaming.

Languages: Python, Shell scripting, PL/SQL, R, PySpark and Bash, Java, SQL, Java Scripting, HTML, CSS.

Databases: Data warehouse, Cassandra, NoSQL (MongoDB), Oracle, HBase, Snowflake, MySQL, PostgreSQL.

Tools: PyCharm, Jupiter Notebook, MS Visual Studio, Microsoft Azure HDInsight, Microsoft Hadoop cluster, JIRA, NetBeans, Eclipse.

Operating Systems: Unix/Linux, Windows.

PROFESSIONAL EXPERIENCE:

Confidential, Minneapolis, MN.

AWS Data Engineer

Responsibilities:

  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Created the automated build and deployment process for application, application setup for better user experience, and leading up to building a continuous integration system.
  • Worked on developing PySpark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Used AWS Glue for the data transformation, Validate and data cleansing
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Worked on migrating MapReduce programs into Spark transformations using Python.
  • Worked on Spark Data sources (Hive, JSON files, Spark Data frames, Spark SQL and Streaming using Python.
  • Used python Boto 3 to configure the services AWS glue EC2, S3.
  • Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.
  • Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations and stored the results to output directory into AWS S3.
  • Creation of AWS Flue jobs, catalong tables creation with crawler and manual database/table creation.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Created spark jobs to apply data cleansing/data validation rules on new source files in inbound bucket and reject records to reject-data S3 bucket.
  • Involved in converting Hive/SQL queries into transformations using Python and performed complex joins on tables in hive with various optimization techniques.
  • Designed and implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark, Airflow and Snowflake.
  • Written Pyspark job in AWS Glue to merge data from multiple table and in utilizing crawler to populate AWS Glue data catalog with metadata table definitions.
  • Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
  • Develop Kafka producer and consumers, HBase clients, Spark jobs using Python along with components on HDFS, Hive.
  • Created Kafka producer API to send live-stream data into various Kafka topics and developed Spark- Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
  • Extracted, transformed and loaded data from various heterogeneous data sources and destinations using AWS.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Worked in Production Environment which involves building CI/CD pipeline using Jenkins with various stages starting from code checkout from GitHub to Deploying code in specific environment.
  • Developed AWS cloud formation templates and setting up Auto scaling for EC2 instances and involved in the automated provisioning of AWS cloud environment using Jenkins.
  • Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.

Environment: Spark, Hive, HBase, Sqoop, Flume, ADF, Blob, cosmos DB, AWS Glue, MapReduce, HDFS, Cloudera, SQL, Apache Kafka, AWS, S3, Kubernetes, Python, Unix.

Confidential, Atlanta, GA.

AWS Data Engineer

Responsibilities:

  • Used AWS Athena extensively to ingest structured data from S3 into other systems such as RedShift or to produce reports.
  • The Spark-Streaming APIs were used to conduct on-the-fly transformations and actions for creating the common learner data model, which receives data from Kinesis in near real time.
  • Designed and developed on entire module called CDC in python and deployed in AWS Glue using PySpark library and Python.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue and Kinesis.
  • Hive As the primary query engine of EMR, we have built external table schemas for the data being processed.
  • Developed AWS RDS (Relational database services) was created to serve as a Hive meta store, and it was possible to integrate the meta data from 20 EMR clusters into a single RDS, avoiding data loss even if the EMR was terminated.
  • Wrote and executed several complex SQL queries in AWS Glue for ETL operations in Spark data using SparkSQL.
  • Involved in the development of a shell script that collects and stores logs created by users in AWS S3 (Simple storage service) buckets. This contains a record of all user actions and is a good indicator of security to detect cluster termination and safeguard data integrity.
  • Partitioning and Bucketing ideas were implemented in the Apache Hive database, which increases query retrieval performance.
  • Using AWS Glue, I designed and deployed ETL pipelines on S3 parquet files in a data lake.
  • Created a cloud formation template in JSON format to leverage content delivery with cross-region replication through Amazon Virtual Private Cloud
  • AWS Code Commit Repository was used to save programming logics and scripts and then replicate them to new clusters.
  • Transformed the data using AWS Glue dynamic frames with pySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature.
  • Used the Multi-Node Redshift technology to implement Columnar Data Storage, Advanced Compression, and Massive Parallel Processing
  • Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Worked onthe code transfer of a quality monitoring program from AWS EC2 to AWS Lambda, as well as the creation of logical datasets to administrate quality monitoring on snowflake warehouses.

Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, AWS Glue, Cloud Formation, Amazon S3, Amazon Redshift, Dynamo DB, Cloud Watch, Hive, Scala, Python, HBase, Apache Spark, Spark SQL, Shell Scripting, Tableau, Cloudera.

Confidential

Data Engineer

Responsibilities:

  • Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms.
  • Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
  • Developed multiple POC’s using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
  • Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
  • Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
  • Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
  • Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
  • Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
  • Written programs in Spark using Python (PySpark) packages for performance tuning, optimization, and data quality validations.
  • Worked on Google Cloud Platform (GCP) services like compute engine, cloud load balancing, cloud storage, cloud SQL, stack driver monitoring and cloud deployment manager.
  • Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.
  • Hands on experience on fetching the live stream data from UDB into HBase table using PySpark streaming and Apache Kafka.
  • Build servers using GCP in the defined virtual private connection using auto-scaling and load balancers.
  • Evaluate Snowflake Design considerations for any change in the application

Environment: HDFS, Python, SQL, Spark, Scala, Kafka, Hive, Yarn, Sqoop, Snowflake, Tableau, AWS Cloud, GitHub, Shell Scripting.

Confidential

Data Engineer

Responsibilities:

  • Involved in developing roadmap for migration of enterprise data from multiple data sources like SQL Server, provider databases into S3 which serves as a centralized data hub across the organization.
  • Loaded and transformed large sets of structured and semi structured data from various downstream systems.
  • Developed ETL pipelines using Spark and Hive for performing various business specific transformations.
  • Building Applications and automating the pipelines in Spark for Bulk loads as well as Incremental Loads of various Datasets.
  • Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.
  • Utilized AWS services like EMR, S3, Glue Megastore and Athena extensively for building the data applications.
  • Worked on building input adapters for data dumps from FTP Servers using Apache spark.
  • Wrote spark applications to perform operations like data inspection, cleaning, load and transforms the large sets of structured and semi-structured data.
  • Developed Spark with Scala and Spark-SQL for testing and processing of data.
  • Reporting the spark job stats, Monitoring, and running Data Quality Checks are made available for each Datasets.
  • Environment: AWS Cloud Services, Apache Spark, Spark-SQL, Snowflake, Unix, Kafka, Scala, SQL Server.

We'd love your feedback!