We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

4.00/5 (Submit Your Rating)

New York, NY

SUMMARY

  • Over 8+ years IT experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment and experience in Python, Java.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
  • Strong experience in building fully automated Continuous Integration & Continuous delivery pipelines and DevOps processors for agile store - based Applications in Retail and Transportations domain.
  • Firm understanding of Hadoop architecture and various components including HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming.
  • Have experience in data analytics, designing reports with visualization solutions using Tableau desktop and publishing on to the Tableau server.
  • Good Knowledge in Amazon Web Services (AWS) concepts like EC2, S3, EMR, Elastic Cache, DynamoDB, Redshift, Aurora.
  • Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as big data Processing, Ingestion,
  • Analytics and Cloud Migration from On-prem to AWS Cloud using AWS EMR, S3, DynamoDB
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Experience with migrating SQL databases to Azure Data Lake, Azure Data Lake Analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse, as well as controlling and giving database access and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
  • Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Demonstrated understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.
  • Experienced in building Snowpipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.
  • Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.
  • Hands on experience across Hadoop Ecosystem that includes extensive experience in Big Data technologies like HDFS, MapReduce, YARN, Apache Cassandra, HBase, Hive, Oozie, Impala, Pig, Zookeeper and Flume, Kafka, Sqoop, Spark.
  • Built real time data pipelines by developing Kafka producers and Spark streaming applications for consuming.
  • Experienced with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data frame API, Spark Streaming, Pair RDD's and worked explicitly on PySpark.
  • Familiar with data processing performance optimization techniques such as dynamic partitioning, bucketing, file compression, and cache management in Hive, Impala, and Spark.
  • Excellent understanding and knowledge in handling database issues and connections with SQL and NOSQL databases like Mongo DB, Cassandra, HBase and SQL server.
  • Experience in Dimension Data modeling concepts like Star Join Schema Modeling,
  • Snow-Flake Modeling, FACT and Dimensions Tables, Physical and Logical Data Modeling.
  • Created and configured SQL Server Analysis Services database which introduced company to a multidimensional tracking of subscribers' special statistical techniques using SQL and Excel
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
  • Adequate knowledge and working experience in Agile and Waterfall Methodologies.
  • Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
  • Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.

PROFESSIONAL EXPERIENCE

AWS Data Engineer

Confidential, New York, NY

Responsibilities:

  • Used AWS Athena extensively to ingest structured data from S3 into other systems such as RedShift or to produce reports.
  • The Spark-Streaming APIs were used to conduct on-the-fly transformations and actions for creating the common learner data model, which receives data from Kinesis in near real time.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue and Kinesis.
  • Hive As the primary query engine of EMR, we have built external table schemas for the data being processed.
  • AWS RDS (Relational database services) was created to serve as a Hive meta store, and it was possible to integrate the meta data from 20 EMR clusters into a single RDS, avoiding data loss even if the EMR was terminated.
  • Involved in the development of a shell script that collects and stores logs created by users in AWS S3 (Simple storage service) buckets. This contains a record of all user actions and is a good indicator of security to detect cluster termination and safeguard data integrity.
  • Partitioning and Bucketing ideas were implemented in the Apache Hive database, which increases query retrieval performance.
  • Using AWS Glue, I designed and deployed ETL pipelines on S3 parquet files in a data lake.
  • Created a cloud formation template in JSON format to leverage content delivery with cross-region replication through Amazon Virtual Private Cloud
  • AWS Code Commit Repository was used to save programming logics and scripts and then replicate them to new clusters.
  • Used the Multi-node Redshift technology to implement Columnar Data Storage, Advanced Compression, and Massive Parallel Processing
  • Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Worked on the code transfer of a quality monitoring program from AWS EC2 to AWS Lambda, as well as the creation of logical datasets to administrate quality monitoring on snowflake warehouses.

Environment: Amazon Web Services, Elastic Map Reduce cluster, EC2s, CloudFormation, Amazon S3, Amazon Redshift, Dynamo DB, Cloud Watch, Hive, Scala, Python, HBase, Apache Spark, Spark SQL, Shell Scripting, Tableau, Cloudera.

Data Engineer

Confidential, Rochester MN

Responsibilities:

  • Design robust, reusable, and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
  • Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
  • Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
  • Identified issue and developed a procedure for correcting the problem which resulted in the improved quality of critical tables by eliminating the possibility of entering duplicate data in a Data Warehouse.
  • Created scripts in Python which integrated with Amazon API to control instance operations.
  • For Processing Spreadsheets - and join with other sources used Scala and developed a framework.
  • Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Scheduled the jobs using Airflow and used airflow hooks to connect to various traditional databases like db2, oracle and Teradata.
  • Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and backup on AWS.
  • Management and Administration of AWS Services CLI, EC2, S3, and Trusted Advisor services.
  • Using Python - SQL Alchemy tried to connect to databases and query the sources to fetch data.
  • Hands on experience on developing UDF, Data Frames and SQL queries in Spark SQL.
  • Creating and modified existing data ingestion pipelines using Kafka and Sqoop to ingest the database tables and streaming data into HDFS for analysis.
  • Finalize the naming Standards for Data Elements and ETL jobs and create a Data dictionary for Meta Data Management.
  • Planned, coordinated analysis, design and extraction of encounter data from multiple source systems into the data warehouse relational database (Oracle) while ensuring data integrity.
  • Involved in designing and developing Amazon EC2, Amazon S3, Amazon RDS, Amazon Elastic Load Balancing, Amazon SWF, Amazon SQS, and other services of the AWS infrastructure.
  • Worked on developing ETL workflows on the data obtained using Python for processing it in HDFS and HBase using Flume.
  • Hands in experience in working with Continuous Integration and Deployment (CI/CD) using Jenkins, Docker.
  • Developing ETL pipelines in and out data warehouse using combination of Python and Snowflake Snow SQL. Writing SQL quires against Snowflake.
  • Conduct systems design, feasibility and cost studies and recommend cost-effective cloud solutions such as Amazon Web Services (AWS).

Environment: Python, HDFS, Spark, Kafka, Hive, Yarn, Cassandra, HBase, Jenkins, Docker, Tableau, Splunk, BO Reports, Netezza, UDB, MySQL, Snowflake, IBM DataStage.

Big Data Engineer/Hadoop Developer

Confidential, Pataskala, Ohio

Responsibilities:

  • Responsible for the design, implementation, and architecture of very large-scale data intelligence solutions around big data platforms.
  • Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
  • Developed multiple POC's using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
  • Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
  • Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
  • Developed ETL procedures to ensure conformity, compliance with minimal redundancy, translated business rules and functionality requirements into ETL procedures.
  • Maintain AWS Data pipeline as web service to process and move data between Amazon S3, Amazon EMR and Amazon RDS resources.
  • Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
  • Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
  • Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
  • Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization, and data quality validations.
  • Worked on developing Kafka Producers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.
  • Hands on experience on fetching the live stream data from UDB into HBase table using PySpark streaming and Apache Kafka.
  • Worked on Tableau to build customize interactive reports, worksheets, and dashboards.

Environment: HDFS, Python, SQL, Web Services, MapReduce, Spark, Kafka, Hive, Yarn, Pig, Flume, Zookeeper, Sqoop, UDB, Tableau, AWS, GitHub, Shell Scripting.

Big Data Engineer

Confidential

Responsibilities:

  • Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
  • Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Worked on building end to end data pipelines on Hadoop Data Platforms.
  • Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
  • Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
  • Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
  • Experience with PySpark for using Spark libraries by using Python scripting for data analysis.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
  • Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
  • Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
  • Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.

Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.

Hadoop Engineer/Developer

Confidential

Responsibilities:

  • Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
  • Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
  • Worked on different files like csv, txt, fixed width to load data from various sources to raw tables.
  • Conducted data model reviews with team members and captured technical metadata through modelling tools.
  • Implemented ETL process wrote and optimized SQL queries to perform data extraction and merging from SQL server database.
  • Experience in loading logs from multiple sources into HDFS using Flume.
  • Worked with NoSQL databases like HBase in creating HBase tables to store large sets of semi-structured data coming from various data sources.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive tables.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and Scala.
  • Data cleaning, pre-processing, and modelling using Spark and Python.
  • Strong Experience in writing SQL queries.
  • Responsible for triggering the jobs using the Control-M.

Environment: Python, SQL, ETL, Hadoop, HDFS, Spark, Scala, Kafka, HBase, MySQL, Netezza, Web Services, Shell Script, Control-M.

We'd love your feedback!