We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

3.00/5 (Submit Your Rating)

Englewood, CO

SUMMARY

  • Around 8 years of extensive IT experience as a data engineer with expertise in designing data - intensive applications using Hadoop Ecosystem and Big Data Analytical, Cloud Data engineering (AWS, Azure), Data Visualization, Data Warehouse, Reporting, and Data Quality solutions.
  • Hands-on expertise with the Hadoop ecosystem, including strong knowledge of Big Data technologies such as HDFS, Spark, YARN, kaf, MapReduce, Apache Cassandra, HBase, Zookeeper, Hive, Oozie, Impala, Pig, and Flume.
  • With the knowledge on Spark Context, Spark-SQL, Data frame API, Spark Streaming, and Pair RDD's, worked extensively on PySpark to increase the efficiency and optimization of existing Hadoop approaches.
  • Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.
  • In-depth understanding and experience with real-time data streaming technologies such as Kafka and Spark Streaming.
  • Hands-on experience on AWS components such as EMR, EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, Redshift, DynamoDB to ensure a secure zone for an organization in AWS public cloud.
  • Proven experience deploying software development solutions for a wide range of high-end clients, including Big Data Processing, Ingestion, Analytics, and Cloud Migration from On-Premise to AWS Cloud.
  • Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).
  • Experience in Migrating SQL database to Azure data Lake, Azure Synapse, Azure data lake Analytics, Azure SQL Database, Data bricks, and Azure SQL Data warehouse and controlling and granting database access and migrating on-premise databases to Azure Data Lake store using Azure Data factory.
  • Strong Experience in working with ETL Informatica which includes components Informatica PowerCenter Designer, Workflow manager, Workflow monitor, Informatica server and Repository Manager.
  • Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks, Data bricks Workspace for Business Analytics, Manage Clusters In Data bricks, Managing the Machine Learning Lifecycle
  • Demonstrated understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.
  • Experienced in building Snowpipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.
  • Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.
  • Expertise in using Airflow and Oozie to create, debug, schedule, and monitor ETL jobs.
  • Experience of Partitions, bucketing concepts in Hive, and designed both Managed and External tables in Hive to optimize performance.
  • Experience with different file formats like Avro, parquet, ORC, JSON, XML, and compressions like snappy &bzip.
  • Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.
  • Hands-on experience in handling database issues and connections with SQL and No SQL databases such as MongoDB, HBase, SQL server. Created Java apps to handle data in MongoDB and HBase.

PROFESSIONAL EXPERIENCE

AWS Data Engineer

Confidential, Englewood, CO

Responsibilities:

  • Implemented solutions utilizing Advanced AWS Components: EMR, EC2, etc. integrated with Big Data/Hadoop Distribution Frameworks: Hadoop YARN, MapReduce, Spark, Hive, etc.
  • Used AWS Athena extensively to ingest structured data from S3 into multiple systems, including RedShift, and to generate reports.
  • Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.
  • Performed end-to-end Architecture and implementation assessment of various AWS services like Amazon EMR, Redshift, S3, Athena, Glue, and Kinesis.
  • Created AWS RDS (Relational database services) to work as Hive metastore and could combine EMR cluster's metadata into a single RDS, which avoids the data loss even by terminating the EMR.
  • Involved in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses.
  • Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.
  • Loaded the data into Spark RDD and performed in-memory data computation to generate the output response.
  • Created ETL jobs on AWS glue to load vendor data from different sources, transformations involving data cleaning, data imputation and data mapping and storing the results into S3 buckets. The stored data was later queried using AWS Athena.
  • Designed and developed ETL process using Informatica 10.4 tool to load data from wide range of sources such as Oracle, flat files, salesforce, Aws cloud.
  • Extracting and uploading data into AWS S3 buckets using Informatica aws plugin.
  • Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.
  • Queried both Managed and External tables created by Hive using Impala.
  • Monitored and controlled Local disk storage and Log files using Amazon CloudWatch.
  • Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata.
  • Involved with extraction of large volumes of data and analysis of complex business logics; to derive business -oriented insights and recommending/proposing new solutions to the business in Excel Report.
  • Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
  • Encoded and decoded json objects using PySpark to create and modify the dataframes in Apache Spark
  • Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
  • Develop ETL jobs to automate the real time data retrieval from Salesforce.com, suggest best methods for data replication from Salesforce.com.
  • Used AWS data pipeline for Data Extraction, Transformation and Loading from homogeneous or heterogeneous data sources and built various graphs for business decision-making using Python matplot library
  • Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.
  • Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed
  • Designed, developed, and managed Power BI, Tableau, QlikView, Qlik Sense Apps including Dashboard, Reports, Storytelling
  • Created a new Power BI reports dashboard with 13 pages according to the design spec in two weeks beating the tight timeline. Deployed an automation to production for update the company holiday schedule based on company’s holiday policy which need to be updated yearly.
  • Used Informatica Power Center for extraction, transformation, and loading (ETL) of data in the data warehouse.
  • Loading data into Snowflake tables from internal stage using SnowSQL
  • Prepared data warehouse using Star/Snowflake schema concepts in Snowflake using SnowSQL
  • Prepared Tableau reports and dashboards with calculated fields, parameters, sets, groups or bins and publish on the server.
  • Designed and implemented ETL pipelines on S3 parquet files on data lake using AWS Glue.
  • Designed AWS architecture, Cloud migration, Dynamo DB and event processing using Lambda function
  • Experience in managing and securing the Custom AMI's, AWS account access using IAM.
  • Managed storage in AWS using Elastic Block Storage, S3, created Volumes and configured Snapshots
  • Experience configuring AWS S3 and their lifecycle policies and to backup files and archive files in Amazon Glacier.
  • Experience in creating and maintaining the databases in AWS using RDS.
  • Created monitors, alarms, notifications, and logs for Lambda functions, Glue Jobs, EC2 hosts using CloudWatch, and used airl for the data transformation, validate and data cleansing.
  • Experience in Building and Managing Hadoop EMR clusters on AWS.
  • Used AWS Beanstalk for deploying and scaling web applications and services developed with Java.
  • Developed Scripts for AWS Orchestration
  • Designed tool API and MapReduce job workflow using AWS EMR and S3.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kinesis in near real-time.
  • Worked with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source system which include loading nested JSON formatted data into snowflake table.
  • Used AWS glue catalog with crawler to get the data from S3 and perform SQL query operations and JSON schema to define table and column mapping from S3 data to Redshift.
  • Worked on EMR Security Configurations, to store the self-signed certificates as well as KMS keys created into it. This makes it spin up a cluster in an easy manner without modifying permissions after the call.
  • Worked with Cloudera 5.12.x and its different components.
  • Installation and setup of multi node Cloudera cluster on AWS cloud.
  • Created Redshift clusters on AWS for quick accessibility for reporting needs. Designed and deployed a Spark cluster and different Big Data analytic tools including Spark, Kafka streaming, AWS and HBase with Cloudera Distribution.
  • Involved in importing the real-time data using Kafka and implemented Oozie jobs for daily imports.
  • In Tableau development enviornment, supported customer service designing ETL jobs, dash boards utilizing data from Redshift.
  • Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics
  • Performed partitioning and Bucketing concepts in Apache Hive database, which improves the retrieval speed when someone performs a query.

Environment: Spark RDD. AWS Glue, Apache Kafka, Amazon S3, SQL, Spark, AWS cloud,, ETL, NumPy, SciPy, pandas, Scikit-learn, Seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark, MLlib, EMR, EC2, and amazon RDS. Data lake, Python, Cloudera Stack, HBase, Hive, Impala, Pig, NiFi, Spark, Spark Streaming, ElasticSearch, Logstash, Kibana, JAX-RS, Spring, Hibernate, Apache Airflow, Oozie, RESTFul API, JSON, JAXB, XML, WSDL, MySql, Cassandra, MongoDB, HDFS, ELK/Splunk, Athena,Azure,tableau,redshift,scala,snowflake,java.jenkins,SnowSQL.

SR. AWS Data Engineer

Confidential, Indianapolis, IN

RESPONSIBILITIES:

  • Exploring with the Spark improving the performance and optimisation of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.
  • Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
  • Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL and performed interactive querying
  • Worked on data pipeline creation to convert incoming data to a common format, prepare data for analysis and visualization, Migrate between databases, share data processing logic across web apps, batch jobs, and APIs, Consume large XML, CSV, and fixed-width files and created data pipelines in Kafka to replace batch jobs with real-time data.
  • Involved in converting Hive/SQL queries into Spark Transformations using Spark
  • RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.
  • Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres.
  • Processed the web server logs by developing multi-hop flume agents by using Avro
  • Sink and loaded into MongoDB for further analysis and worked on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.
  • Developed a Python Script to load the CSV files into the S3 buckets and created
  • AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket
  • Worked with different file formats like JSON, AVRO and parquet and compression techniques like snappy and developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
  • Developed shell scripts for dynamic partitions adding to hive stage table, verifying JSON schema change of source files, and verifying duplicate files in source location.
  • Worked with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).
  • Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive structured and unstructured data.
  • Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data
  • Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
  • Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake.

Environment: Spark, AWS, EC2, EMR, Hive, SQL Workbench,Tableau, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Informatica, Jenkins, Docker, Hue, Spark, Netezza, Kafka, HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS Glue, GIT, Grafana.

Big Data Engineer

Confidential, Boyertown, PA

Responsibilities:

  • Created Spark jobs by writing RDDs in Python and created data frames in Spark SQL to perform data analysis and stored in Azure Data Lake.
  • Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to HDFS using Scala.
  • Developed Spark Applications by using kafka and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
  • Created various data pipelines using Spark, Scala and SparkSQL for faster processing of data.
  • Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.
  • Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster
  • Developed data pipeline using Flume to ingest data and customer histories into HDFS for analysis.
  • Executing Spark SQL operations on JSON, transforming the data into a tabular structure using data frames, and storing and writing the data to Hive and HDFS.
  • Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries.
  • Created hive tables as per requirement were Internal or External tables defined with appropriate static, dynamic partitions, and bucketing, intended for efficiency.
  • Used Hive as an ETL tool for event joins, filters, transformations, and pre-aggregations.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Kafka.
  • Extracting real-time data using Kafka and Spark streaming by Creating DStreams and converting them into RDD, processing it, and stored it into.
  • Used Spark SQL for Scala interface that automatically converts RDD case classes to schema RDD.
  • Extracted source data from Sequential files, XML files, CSV files, transformed and loaded it into the target Data warehouse.
  • Solid understanding of No SQL Database (MongoDB and Cassandra).
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Scala extracted large datasets from Cassandra and Oracle servers into HDFS and vice versa using Sqoop.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Involved in Migrating the platform from Cloudera to EMR platform.
  • Developed analytical component using Scala, Spark and Spark Streaming.
  • Worked on developing ETL processes to load data from multiple data sources to HDFS using FLUME, and performed structural modifications using HIVE.
  • Provided technical solutions on MS Azure HDInsight, Hive, HBase, MongoDB, Telerik, Power BI, Spot Fire, Tableau, Azure SQL Data Warehouse Data Migration Techniques using BCP, Azure Data Factory, and Fraud prediction using Azure Machine Learning.

Environment: Hadoop, Hive, Kafka, Snowflake, Spark, Scala, HBase, Cassandra, JSON, XML, UNIX Shell Scripting, Cloudera, MapReduce, Power BI, ETL, MySql, No SqL

Big Data Engineer

Confidential

Responsibilities:

  • Collaborated with business user’s/product owners/developers to contribute to the analysis of functional requirements.
  • Implemented Spark SQL queries that combine Hive queries with Python programmatic data manipulations supported by RDDs and data frames.
  • Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.
  • Extract Real-time feed using Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data in HDFS.
  • Developing Spark scripts, UDFS using Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
  • Installed and configured Hadoop Mapreduce HDFS Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
  • Installed and configured Pig and also written PigLatin scripts.
  • Wrote MapReduce job using Pig Latin.
  • Worked on analyzing Hadoop clusters using different big data analytic tools including HBase database and Sqoop.
  • Worked on importing and exporting data from Oracle, and DB2 into HDFS and HIVE using Sqoop for analysis, visualization, and generating reports.
  • Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.
  • Created ETL packages with different data sources (SQL Server, Oracle, Flat files, Excel, DB2, and Teradata) and loaded the data into target tables by performing different kinds of transformations using SSIS.
  • Designed, developed data integration programs in a Hadoop environment with No SQL data store Cassandra for data access and analysis.
  • Created partitions, bucketing across the state in Hive to handle structured data using Elastic search.
  • Performed Sqooping for various file transfers through the HBase tables for processing of data to several No SQL DBs- Cassandra, Mongo DB.

Environment: Hadoop, MapReduce, HDFS, Hive, python, Kafka, HBase, Sqoop, No SQL, Spark 1.9, PL/SQL, Oracle, Cassandra, Mongo DB, ETL, MySql

Data Analyst

Confidential

Responsibilities:

  • Involved in designing physical and logical data model using ERwin Data modeling tool.
  • Designed the relational data model for operational data store and staging areas, Designed Dimension & Fact tables for data marts.
  • Extensively used ERwin data modeler to design Logical/Physical Data Models, relational database design.
  • Created Stored Procedures, Database Triggers, Functions and Packages to manipulate the database and to apply the business logic according to the user's specifications.
  • Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.
  • Creation of database links to connect to the other server and Access the required info.
  • Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.
  • Used Advanced Querying for exchanging messages and communicating between different modules.
  • System analysis and design for enhancements Testing Forms, Reports and User Interaction.

Environment: Oracle 9i, SQL* Plus, PL/SQL, ERwin, TOAD, Stored Procedures.

We'd love your feedback!