We provide IT Staff Augmentation Services!

Sr Aws Big Data Engineer Resume

3.00/5 (Submit Your Rating)

VA

SUMMARY

  • 08+ years of experience as a Big Data Engineer with expertise in designing data - intensive applications using the Hadoop Ecosystem, Big Data Analytical, Cloud Data Engineering, Data Warehouse, and Data Quality solutions.
  • Good experience in developing data pipelines involving Big Data Technologies, AWS Services, and Object-oriented analysis along with several DevOps tools like Docker, Kubernetes, Artifactory, Jenkins, and GitHub.
  • Experienced in installation, configuration, administration, troubleshooting, tuning, security, backup, recovery and upgrades of operating systems in a large environment.
  • Managed applications in AWS and familiar with SDKs and core services including AWS SDK for JAVA, Python Boto3, and AWS Glue using PySpark, EC2, IAM, S3, lambda, Kinesis, EMR, Step Functions, and Event Bridge.
  • Practical experience in data generation, data masking, data sub setting, data archiving, data virtualization, data modelling, and database development.
  • Implemented infrastructure as code architecture in AWS environments using Terraform and AWS Cloud Formation Template.
  • Designed an analytical layer on top of the AWS S3 data lake by creating Glue Catalog Tables and worked with Data Scientists to query the tables using AWS Athena.
  • Worked on implementing Spark application in Spark on Kubernetes environment and deployed the spark application into the Kubernetes environment dedicated to Spark jobs using the Scaffold.
  • Worked on the JAVA Spring Batch framework for the ETL process involved in the serverless batch application.
  • Participate in Proof-of-Concept (POC) to test new technologies that can be integrated into the data pipelines and document the findings.
  • Worked on Java Hadoop Map Reduce framework extensively to integrate, transform and cleanse data and store it into HDFS.
  • Experience in developing applications with serverless architecture by making use of lambda, Glue, and various AWS services for orchestration.
  • Worked on developing ETL processes to load data from multiple data sources to HDFS using SQOOP, Pig, and Oozie.
  • Perform root-cause analysis and resolution of application defaults.
  • Used Maven for building and deploying application source code on various servers and S3 buckets, Worked on JUnit for unit testing.
  • Process unstructured data into a form suitable for analysis.
  • Very good understanding of star schema, snowflake schema, Normalization (1NF, 2NF, and 3NF), and
  • Dimensional modeling.
  • Experience in implementing data pipelines for moving large volumes of data.
  • In-depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Task Tracker, and Map Reduce programming paradigm.
  • Perform structural modifications using Map Reduce, HIVE and analyze data using visualization/reporting tools.
  • Extensive experience in Hadoop-led development of enterprise-level solutions utilizing Hadoop components such as Apache Spark, MapReduce HDFS, Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper, and YARN.
  • Ensure defect-free programming by testing and debugging using available/appropriate tools and participate in reviewing peer coding.
  • Profound experience in performing Data Ingestion, Data Processing (Transformations, enrichment, and aggregations).
  • Assembling large, complex data sets that meet functional / non-functional business requirements.
  • Strong Knowledge on the Architecture of Distributed systems and parallel processing, In-depth understanding of MapReduce programming paradigm and Spark execution framework.
  • Experienced with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, and Pair RDDs and worked explicitly on PySpark.

TECHNICAL SKILLS

Languages: Java, Python, PERL, Shell Scripting, and UNIX.

Frameworks: Hadoop Map Reduce, Pyspark, Spark Core, Spark SQL, JUnit, Pytest, and Java Spring Batch

Databases: ORACLE, SQL Server, MongoDB, Redshift, Snowflake, and Dynamo DB

Cloud Environment s: Glue, Lambda, Athena, EC2, EMR, S3, IAM Role, VPC, Subnets, Security Groups, CloudWatch, Cloud Trail, Redshift, Auto Scaling Group, Route53, ELB, Cloud Formation Template, Kinesis, and Event Bridge

Bigdata Ecosystem: HDFS, Hive, Sqoop, Pig scripting, Zookeeper, Flume, and Hue

DevOps Tools: Docker, Kubernetes, Artifactory, Jenkins, Terraform, Cloud Formation Template, and Scaffold

Logging & Monitoring: ELK, Splunk, Kibana, PagerDuty, and VictorOps

Tools: & Others: Airflow, Snap logic, GIT, Maven, Tomcat 6.0, JUnit 4.0, Toad, JIRA, and Confluence.

PROFESSIONAL EXPERIENCE

Confidential, VA

Sr AWS Big Data Engineer

Responsibilities:

  • Worked on AWS Glue for extracting data from external APIs and relational databases to S3 and finally loading to Redshift DB.
  • Built Glue jobs in Pyspark and used various Pyspark libraries to transform the data as per the business requirements and developed Lambda scripts.
  • Worked on an application that extracts data from an external vendor and transforms it using AWS Glue in Pyspark and loads it into the EDW as well as EDL and the end users use tableau and Athena to query.
  • Deployed Glue code and other AWS services objects using Cloud Formation and leveraged boto3 libraries for connecting to various AWS components.
  • Working on an application to extract reservation data from Dynamo DB and load it into AWS S3 using
  • AWS Glue with Pyspark and did a lot of performance tuning using a dynamic data frame.
  • Worked on developing lambda scripts performing ETL logic connecting to databases and implemented triggering of lambda scripts through kinesis message streams.
  • Worked on loading hierarchical JSON data to AWS OpenSearch through AWS Glue jobs which are used by other applications.
  • Worked on setting up AWS Data Migration Services (DMS) tasks for loading full load and CDC data to S3.
  • Worked on other AWS Services like Step Functions, Event bridge Rules, Athena, and AppFlow.
  • Environment: AWS Services (Glue, Redshift, Lambda, OpenSearch, Cloud Formation, Kinesis), Python, SQL, PySpark Developed AWS lambda in Java to kick off the S3 event files and load into EDW. Worked on an application to improve the customer return rate by providing upgrades to those loyal customers by identifying those using advanced analytics.
  • Worked on an application in Scala-AWS Glue to extract ticket data from Oracle and load them as parquet files on S3.
  • Worked on an application that extracts data from an external vendor and transforms it using AWS Glue in
  • Pyspark and loads it into the EDW as well as EDL and the end users use tableau and Athena to query.
  • Analyzed EMR and Glue to rewrite an existing Legacy ETL job and came up with cost and performance benchmarks to follow different approaches.
  • Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.

Confidential

AWS Data Engineer

Responsibilities:

  • Involved in the complete Bigdata flow of the application starting from data ingestion from upstream to AWS S3 data lake, processing and analyzing the data in the Data Lake.
  • Followed Agile & Scrum principles in developing the project.
  • Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
  • Importing Large Data Sets from DB2 to Hive Table using Sqoop.
  • Created Partitioned tables in Parquet File Formats with Snappy compression and then loaded data into Glue Catalog tables which can be queried with Athena.
  • Imported data from AWS S3 and into Spark RDD and performed transformations and actions on RDDs.Developed Spark scripts by using Scala Shell commands as per the requirements.
  • Used Spark-SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Developing Spark code in Pyspark and Spark SQL environment for faster testing and processing of data and loading the data into Spark RDD and doing In-memory computation to generate the output response with less memory usage.
  • Utilized Spark Coreand Spark SQLAPI for fasterprocessing of data insteadof using MapReduce in Java.
  • Responsible for data extraction and data integration from different data sources into AWS Data Lake by creating ETL pipelines using Pyspark implemented in AWS Glue and EMR.
  • Load the data into Spark RDD and do in-memory data Computation to generate the Output response.
  • Wrote different pig scripts to clean up the ingested data and created partitions for the daily data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Python.

Confidential

AWS Data Engineer

Responsibilities:

  • Analyzed the SQL scripts and designed the solution to implement using PySpark.
  • Involved in converting the MapReduce programs into the Spark transformations using the Spark RDD in Pyspark.
  • Used Oozie workflow to coordinate pig and Hive Scripts.
  • Worked on Airflow 1.8 and 1.9 for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.
  • Worked on creating digs involving various operators including python and Pyspark and integrating them in Glue jobs and calling them within the dag.
  • Implemented Airflow for authoring, scheduling, and monitoring Data Pipelines.
  • Wrote AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
  • Migrated the existing on-premises applications to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, experienced in maintaining Hadoop cluster on AWS EMR.
  • Developed Pyspark code for AWS Glue and for EMR.
  • Created Step functions to orchestrate Glue jobs to run sequentially and designed error-handling scenarios by handling exceptions and retry mechanisms.
  • Configured Event Bridge to the step function and controlled the Data processing schedules.
  • Scheduled spark applications/Steps in the AWS EMR cluster.
  • Worked on writing lambda code using AWS SDK for Java to handle SNS events and S3 bucket events for the trigger. Written scripts to create lambda automatically on the first deployment and update the code by just deploying the code the very next time.
  • Created a generic extraction framework (for SQL server) in python and also a generic script to dynamically install SQL server driver which can be reused in many applications and can run on AWS.
  • Worked on Mongo Template (Spring Batch API) to extract data (on EC2) from Mongo DB which is hosted on AWS and performed various aggregations and stored the results in a list.
  • Worked on loading datasets from SQL server to AWS S3 in a serverless application and created a Cerebro layer on top of it. Spun up an EMR to test and query the loaded tables.
  • Worked on automation of EC2 stack creation from Jenkins and ran the required extract process.
  • Worked as a part of the Snap logic Platform team where I was responsible for installing the snap logic tool on AWS and configuring the DevOps process.
  • Was involved in creating code migration scripts along with configuring jobs in Jenkins and automated scripts for the CICD integration.

Confidential

Hadoop Developer

Responsibilities:

  • Worked on loading disparate data sets coming from different sources to the Hadoop environment using SQOOP.
  • Loading datasets from two different sources Oracle, and MySQL to HDFS and Hive respectively.
  • Used Control M to schedule Hadoop jobs to run on the edge node and have a retry mechanism included for any errors encountered.
  • Implemented bucketing and partitioning on the Hive tables on top of Hadoop and created different views for sensitive data and controlled access to the views based on the roles assigned to the users.
  • Developed UNIX scripts for creating Batch loads for bringing a huge amount of data from Relational databases to the BIGDATA platform.

We'd love your feedback!