We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

0/5 (Submit Your Rating)

St Louis, MI

SUMMARY

  • Having 6 years of experience in Big Data Environment, Hadoop Ecosystem with 3 years of experience on AWS and AZURE.
  • Extensive knowledge of Spark Streaming, Spark SQL, and other Spark components such as accumulators, broadcast variables, various levels of caching, and optimization techniques for Spark employment
  • Prior experience with the AWS - BigData/Hadoop Ecosystem and DataLake implementation.
  • Extensive hands-on experience with AWS services, such as SG, EMR, S3, EC2, Route53, RDS, and ELB and DynamoDB, SQS, Quicksight, Sagemaker, Cloud formation and have pretty good experience on Redshift spectrum and AWS Athena query services for reading the data from S3.
  • Good knowledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
  • Developed ETL scripts for data acquisition and transformation using Informatica and Talend.
  • Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS).
  • Wrote Spark applications for data validation, cleansing, transformation, and aggregation.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Strong experience in using Spark Streaming, Spark SQL and other components of spark like accumulators, Broadcast variables, different levels of caching and optimization techniques for spark jobs
  • Good experience working on AWS-BigData/Hadoop Ecosystem in the implementation of DataLake
  • Strong hands on experience with AWS services, including but not limited to EMR, S3, EC2, route S3, RDS, ELB, Dynamo DB, Glue, SNS, SQS, Cloud Formation, etc.
  • Developed Cassandra tables to store a variety of data formats from various sources
  • In-depth understanding and experience of real-time data streaming technologies such as Kafka and Spark Streaming
  • Set up Spark Streaming to receive real-time data from Kafka and save the stream to HDFS.
  • Worked with MongoDB database concepts like locking, transactions, indexes, and replication
  • Developed Batch and streaming workflows with in-built Stone branch scheduler and bash scripts to automate the Data Lake systems.
  • Programming in C#, U-SQL, Hive, writing the scope scripts in Data Lake and HDFS to structure the Peta bytes of unstructured data stored in the Azure DataLake (Cosmos) big data system.
  • Worked on the Hortonworks based Hadoop platform deployed on 120 nodes cluster to build the Data Lake, utilizing the Spark, Hive and NoSQL for data processing
  • Developed Batch and streaming workflows with in-built Stone branch scheduler and bash scripts to automate the Data Lake systems.
  • Helped businesspeople reduce the amount of manual work they were doing by developing Python scripts for LDA sourcing, One Lake, SDP, Databricks, Data bench, and Snowflake to obtain cloud metrics and make their efforts easier.
  • Familiar with data processing performance optimization techniques such as dynamic partitioning, bucketing, file compression, and cache management in Hive, Impala, and Spark.
  • Proficient with container systems such as Docker and container orchestration tools such as EC2 Container Service, Kubernetes, and Terraform.
  • Kubernetes was used to manage Docker orchestration and containerization.
  • Kubernetes was used to orchestrate Docker Container deployment, scaling, and management.
  • Created JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF), which uses the Cosmos Activity to process the data.
  • As part of the EDS transition, I worked on Azure Data Factory and Azure Databricks
  • Worked on Kafka and Spark streaming to implement real-time streaming
  • Processed data with SAS Enterprise Guide and generated new variables to use as predictors. Created Spark applications that can handle data from various RDBMS (MySQL, OLE DB, etc.)
  • Developed a monthly report using Python to code the payment results of customers and make suggestions to the Manager
  • Migrated an on-premises program to Amazon Web Services. I've used AWS services like EC2 and S3 to process and store small data sets, and I've worked with Hadoop clusters on AWS EMR.
  • In-depth knowledge of Dimensional Modeling, Data Migration, High-Volume Data Loading, Data Cleansing, and other ETL processes.
  • A thorough understanding of data engineering, including data pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning, and advanced data processing.
  • Working experience with NoSQL databases such as Cassandra and HBase, as well as developing real-time read/write access to very large datasets using HBase.

PROFESSIONAL EXPERIENCE

AWS Data Engineer

Confidential, St Louis, MI

Responsibilities:

  • Creating and sustaining an optimal data pipeline architecture
  • Responsible for loading data from the internal server and the Snowflake data warehouse into S3 buckets.
  • Created the infrastructure needed for optimal data extraction, transformation, and loading from a wide range of data sources.
  • In the Hadoop/Hive environment with Linux for big data resources, developed Spark/Scala, Python for regular expression (regex) project.
  • Data sources are extracted, transformed and loaded to generate CSV data files with Python programming and SQL queries.
  • Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
  • Creating and writing aggregation logic on Snowflake Datawarehouse tables.
  • Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data ingestion and transformation in AWS and coordinate task among the team.
  • Recreated and maintained existing Access Database artifacts in Snowflake.
  • Developed AWS Athena extensively to ingest structured data from S3 into various systems such as RedShift or to generate reports.
  • Consumed Kafka messages and curated using Python send the data into multiple targets Redhshift, Athena and S3 buckets.
  • Used AWS Quicksight for visualization.
  • Used Python libraries like Numpy, Pandas, matplotlib, seaborn and sklearn.
  • Used Data science models Linear regression, logsticRegression, KneighborsClassifier, Random Forste Classifier, Dummy Classifier, ARIMA, SARIMA to predict and used the data insights to make business decisions.
  • Worked with Data Science, Marketing and Sales team to develop the Data Pipelines as per there need.
  • Used AWS Sagemaker to clean and run some Data science models like classifications, ARIMA, SARIMA for
  • Proficient in SQLite, MySQL and SQL databases with Python.
  • Experienced in working with various Python IDE’s using PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans and Sublime Text
  • Designed, built, and maintained data integration programs in a Hadoop and RDBMS environment, working with both traditional and non-traditional source systems, as well as RDBMS and NoSQL data stores for data access and analysis, Extensive experience with Python, including the creation of a custom ingest framework.
  • Used the Spark API to analyze Hive data in conjunction with the EMR Cluster Hadoop Yarn
  • AWS Cloud Formation templates were designed to create VPCs, subnets, and NAT to ensure the successful deployment of Web applications and database templates.
  • Creating S3 buckets and managing S3 bucket policies, as well as using S3 buckets and Glacier for storage and backup on AWS.
  • Migrated Hive and MapReduce jobs from on-premises MapR to AWS cloud using EMR and Quble
  • Expertise with AWS databases such as RDS (Aurora), Redshift, DynamoDB, and Elastic Cache (Memcached & Redis)

Environment: Hadoop YARN, AWS Sage Maker, AWS Glue, AWS Athena Spark, Python Spark Streaming Spark SQL, Kubernetes, Oracle, SQL Server, MySQL, HBase, MongoDB, RedShift, DynamoDB

AZURE Engineer

Confidential, Los Angeles, CA

Responsibilities:

  • Analyze, design, and develop modern data solutions that enable data visualization using the Azure PaaS service.
  • Using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Storage services, extract, transform, and load data from sources systems to Azure Data Lake Analytics.
  • Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
  • Project lifecycle from analysis to production implementation, with emphasis on identifying data validation, developing logic and transformations as per requirements and creating notebooks to load the data into Delta-Lake .
  • Created Databricks Delta Lake process for real-time data load from various sources (Databases, Adobe and SAP) to AWS S3 data-lake using Python/PySpark code.
  • Ingestion of data into one or more Azure Services (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing in Azure Databricks
  • Pipelines were created in ADF using Linked Services/Datasets/Pipeline/ to extract, transform, and load data from a variety of sources, including Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool, and backwards.
  • Experienced in Hive queries to analyze massive data sets of structured, unstructured, and semi-structured data.
  • Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
  • Used advanced Hive techniques such as bucketing, partitioning, and optimizing self joins to boost performance on structured data.
  • The CI/CD framework designed, tested, and deployed using Kubernetes and Docker as the runtime environment.
  • Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark DataBricks cluster.
  • Owned several end-to-end transformations of customer business analytics problems, breaking them down into a mix of appropriate hardware (IaaS/PaaS/Hybrid) and software (MapReduce) paradigms, and then applying machine learning algorithms to extract useful information from data lakes.
  • On both Cloud and On-Prem hardware, sized and engineered scalable Big Data landscapes with central Hadoop processing platforms and associated technologies including ETL tools and NoSQL databases to support end-to-end business use cases.
  • Numerous BigData training and demonstration sessions were conducted for various government and private sector customers in order to ramp them up on Azure Big Data solutions.
  • Developed a number of technology demonstrators using the Confidential Edison Arduino shield, Azure EventHub, and Stream Analytics, and integrated them with PowerBI and Azure ML to demonstrate the capabilities of Azure Stream Analytics.

Environment: Azure Data Factory(V2), Azure Databricks, Python 2.0, SSIS, Azure SQL, Azure Data Lake, Azure Blob Storage, Spark 2.0, Hive.

Data Engineer

Confidential

Responsibilities:

  • Running Spark SQL operations on JSON, converting the data into a tabular structure with data frames, and storing and writing the data to Hive and HDFS.
  • Developing shell scripts for data ingestion and validation with different parameters, as well as writing custom shell scripts to invoke spark Employment.
  • Tuned performance of Informatica mappings and sessions for improving the process and making it efficient after eliminating bottlenecks.
  • Worked on complex SQL Queries, PL/SQL procedures and convert them to ETL tasks
  • Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
  • Created a risk-based machine learning model (logistic regress, random forest, SVM, etc.) to predict which customers are more likely to be delinquent based on historical performance data and rank order them.
  • Evaluated model output using the uncertainty matrix (Precision, Recall as well as Teradata resources and utilities (BTEQ, Fast load, Multi Load, Fast Export, and TPUMP).
  • Developed a monthly report using Python to code the payment results of customers and make suggestions to the manager.
  • Ingestion and processing of Comcast setup box click stream events in real time with Spark 2.x, Spark Streaming, Databricks, Apache Storm, Kafka, Apache-Memory Ignite's grid (Distributed Cache)
  • Used various DML and DDL commands for data retrieval and manipulation, such as Select, Insert, Update, Sub Queries, Inner Joins, Outer Joins, Union, Advanced SQL, and so on.
  • Using Informatica Power Center 9.6.1, I extracted, transformed, and loaded data into Netezza Data Warehouse from various sources such as Oracle and flat files.
  • Participated in the transfer of maps from IDQ to power center.
  • Data was ingested from a variety of sources, including Kafka, Flume, and TCP sockets.
  • Data was processed using advanced algorithms expressed via high-level functions such as map, reduce, join, and window.

Environment: s: Scala 2.12.8, Python 3.7.2, PySpark, Spark 2.4, Spark ML Lib, Spark SQL, TensorFlow 1.9, NumPy 1.15.2, Keras 2.2.4, PowerBI, Spark SQL, Spark Streaming, HIVE, Kafka, ORC, Avro, Parquet, HBase, HDFS.

Big Data Developer

Confidential

Responsibilities:

  • Develop, improve, and scale processes, structures, workflows, and best practices for data management and analytics.
  • Having experience in working with data ingestion, storage, processing and analyzing the big data
  • Collaborate with product owners to develop an experiment design and a measuring method for the efficacy of product changes.
  • Collaborate with Project Management to provide accurate forecasts, reports, and status.
  • Work in a fast-paced agile development environment to analyze, create, and evaluate possible business use cases.
  • Hands-on experience with methods such as Pig and Hive for data collection, Sqoop for data absorption, Oozie for scheduling, and Zookeeper for cluster resource coordination.
  • Worked on the Apache Spark Scala code base, performing actions and transformations on RDDs, Data Frames, and Datasets using SparkSQL and Spark Streaming Contexts.
  • Transferred data from HDFS to Relational Database Systems using Sqoop and vice versa. Upkeep and troubleshooting
  • Spring/MVC framework was used to allow interactions between JSP/View layer and different design patterns were implemented using J2EE and XML technology.
  • Investigating the use of Spark background and Spark-based algorithms to improve the efficiency and optimization of existing Hadoop algorithms.
  • Worked on analyzing Hadoop clusters with various big data analytic tools such as Pig, HBase database, and Sqoop.
  • Worked on NoSQL enterprise development and data loading into HBase with Impala and Sqoop.
  • Executed several MapReduce jobs in Pig and Hive for data cleaning and pre-processing.
  • Build Hadoop solutions for big data problems by using MR1 and MR2 in YARN.
  • Evaluated Hadoop and its ecosystem's suitability for the aforementioned project, and implemented / validated with various proof of concept (POC) applications in order to ultimately adopt them to benefit from the Big Data Hadoop initiative.
  • Work closely with malware research/data science teams to enhance malicious site detection, and machine learning/data mining based big data system
  • Participate in the entire development life cycle, which includes requirements review, design, development, implementation, and operations support.
  • Collaborate with engineering team members to investigate and develop novel ideas while sharing expertise.

Environment: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Java, Spring, MVC, XML, Spark 1.9, PL/SQL, HDFS, JSON, Hibernate, Bootstrap, JQuery, JDBC, JSP, JavaScript, AJAX, Oracle 10g/11g, MySQL, SQL server, Teradata, Hbase, Cassandra

We'd love your feedback!