We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

3.00/5 (Submit Your Rating)

OhiO

SUMMARY

  • 8+ years of experience in the IT industry with major focus in the field of Bigdataand cloud related technologies.
  • Integration using SQL and Big Data technologies as well as Java / J2EE technologies with AWS, AZURE.
  • Expertise in Hadoop (HDFS, MapReduce, Yarn, Hive, Pig, HBase, Zookeeper, Oozie, and Sqoop), Spark - (Spark Core, Spark SQL, Spark Streaming),AWSservices-(Redshift, EMR, EC2, S3, CloudWatch, Lambda, Step Function, Glue, and Athena).
  • Experience in developing complex frameworks fordataingestion,data processing,datacleansing, and analytics using Spark.
  • Experience in designing, building, and maintaining ETL jobs anddatapipelines to integratedata from different sources like Kafka, S3, SFTP servers, RDBMS, and multipledatastores like HBase, Hive, Athena, DynamoDB, etc.
  • Good experience in designing and developingdatamarts and data warehouses for the end-users.
  • Experience in building streamingdatapipelines using Kafka to read real timedata. Experience indatacuration using Spark.
  • Hands on experience in implementing, automating and integrating BigDatainfrastructure resources like S3, Redshift, Aurora, Kinesis, Kafka, EMR, Lambda SNS, Azure Blob Storage Account, SQLData Warehouse, Microsoft Event Hubs HDInsights, Azure Databricks, Azure Functions, Event Grid,DataLake Analytics in an ephemeral/transient and in an elastic manner
  • Hands on experience of core Operating systems like Linux RHEL, Ubuntu, System administration tasks including shell scripting
  • Experience with container technology like docker, Kubernetes etc.
  • Designing, Architecting, and Developing solutions leveraging cloud bigdata technology to ingest, process and analyze large, disparatedatasets to exceed business requirements
  • End to enddatawarehouse development and implementation experience on-premises and/or cloud environment (Azure,AWS) using MSBI or IBM
  • Experience with preparing technical/solution architecture and presenting to a CIO/CTO in the organization and get the buy-in from the client teamsHands-on experience with design and implementation of enterprise scaledatawarehouses, analytics and reporting solutions
  • Experience in Hadoop Ecosystem components like Hive, HDFS, Sqoop, Spark, Kafka, Pig.
  • Experience in architecting, designing, installation, configuration and management of Apache Hadoop Clusters, MapReduce, Hortonworks & Cloudera Hadoop Distribution.
  • Good understanding of Hadoop architecture and Hands on experience with Hadoop components such as Resource Manager, Node Manager, Name Node, Data Node and Map Reduce concepts and HDFS Framework.
  • Expertise in Data Migration, Data Profiling, Data Ingestion, Data Cleansing, Transformation, Data Import, and Data Export with multiple ETL tools such as Informatica Power Centre.
  • Working knowledge of Spark RDD, Data frame API, Data set API, Data Source API,
  • Experience in Azure Data Factory, Integration run time (IR), File Data Ingestion, Relational data ingestion
  • Developed Spark code using Scala, Python and Spark-SQL/Streaming for faster processing of data.
  • Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and spark-shell accordingly.
  • Profound experience in creating real time data streaming solutions using Apache Spark /Spark Streaming, Kafka and Flume.
  • Good knowledge of using Apache NiFi to automate the data movement between different Hadoop Systems.
  • Worked with various formats of files like delimited text files, JSON files, XML Files. Mastered in using different columnar file formats like RC, ORC and Parquet formats and has a good understanding of various compression techniques used in Hadoop processing like G-zip, Snappy, LZO etc.
  • Hands on experience building enterprise applications utilizing Java, J2EE, Spring, Hibernate, JSF, JMS, XML, EJB, JSP, Servlets, JSON, JNDI, HTML, CSS and JavaScript, SQL, PL/SQL.
  • Experienced in Software Development Lifecycle (SDLC) using SCRUM, Agile methodologies.
  • Knowledge in Data mining and Data warehousing using ETL Tools and Proficient in Building reports and dashboards in Tableau (BI Tool).
  • Excellent knowledge of job workflow scheduling and locking tools/services like Oozie and Zookeeper.
  • Good understanding and knowledge of NoSQL databases like HBase and Cassandra.
  • Good understanding of Amazon Web Services (AWS) like EC2 for computing and S3 as storage mechanism and EMR, Step functions, Lambda, Redshift, DynamoDB.
  • Good understanding and knowledge of Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory and Logic Apps.

TECHNICAL SKILLS

Hadoop Components / Big Data: HDFS, Hue, MapReduce, PIG, Hive, HCatalog, HBase, Sqoop, Impala, Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos, pyspark

Reporting and ETL Tools: Tableau, Power BI, Talend, AWS GLUE.

Databases: Airflow, Kafka Snowflake

Languages: Scala, SQL, Python, Hive QL, KSQL, Java

IDE Tools: Eclipse, IntelliJ, PyCharm.

Cloud platform: AWS, Azure

Containerization: Docker, Kubernetes

CI/CD Tools: Oracle, SQL Server, MySQL, MS Access, NoSQL Database (HBase, Cassandra, Mongo DB)

Big Data Technologies: Hadoop, HDFS, Hive, Pig, Oozie, Sqoop, Spark, Machine Learning, Pandas, NumPy, Seaborn, Impala, Zookeeper, Flume, Airflow, Informatica, Snowflake, Data Bricks, Kafka, Cloudera Jenkins, Bamboo, GitLab

Operating Systems: UNIX, LINUX, Ubuntu, CentOS.

Software Methodologies: Agile, Scrum, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, Ohio

AWS Data Engineer

Responsibilities:

  • Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
  • Developed PySpark Data Ingestion framework to ingest source claims data into HIVE tables by performing Data cleansing, Aggregations and applying De-dup logic to identify updated and latest records.
  • Involved in creating End-to-End data pipeline within distributed environment using the big data tools, Spark framework and Tableau for data visualization.
  • Worked on developing CFT’s for migrating the infra from lower environment to higher environment.
  • Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
  • Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
  • Experience in using the AWS services Athena, Redshift and Glue ETL jobs.
  • Involved in loading data from AWS S3 to Snowflake and processed data for further analysis.
  • Developed Analytical dashboards in Snowflake and shared data to downstream.
  • Worked on building data centric queries to cost optimization in Snowflake.
  • Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
  • Experience in using Spark SQL to handle structureddatafrom Hive in AWS EMR Platform.
  • Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Written unit test cases for Spark code for CICD process.
  • Good knowledge about the configuration management tools like BitBucket/Github and Bamboo (CICD).

Environment: AWS EMR, S3, Lambda, Step Functions, CFT’s, SNS, SQS, Snowflake, Kinesis, CloudWatch, EMRFS, LINUX, Hive, PySpark, Python, Snowflake, Tableau.

Confidential, Tampa, Florida

Azure Data Engineer

Responsibilities:

  • Data lake ingestion for Global Data Analytics Platform (GDAP) Merchandise Domain.
  • ETL Pipelines that brings and transforms the huge volumes of data from different source systems Using Python Data Ingestion Pipelines.
  • Understand the data model document and analyze to identify the source systems and transformations required to ingest the data into Data Lake.
  • Design and develop data pipeline using the YAML scripts and AORTA framework.
  • Code development to extract data from different source like ORACLE, Teradata, DB2, SQL Server using SQOOP and AORTA.
  • Hands on with Git/GitHub for code check-ins/checkouts and branching etc.
  • Utilized Azure HDInsight to monitor and manage the Hadoop Cluster.
  • Developed Spark scripts using Python on Azure HDInsight for Data Aggregation, Validation and verified its performance over MR jobs.
  • Designed and built a Data Discovery Platform for a large system integrator using Azure HdInsight components. Used Azure data factory and data Catalog to ingest and maintain data sources. Security on HdInsight was enabled using Azure Active directory.
  • Development of Python Data Ingestion Pipeline to integrate the ETL components.
  • Development of workflows using AUTOMIC automation tool and using automic Shell scripting capabilities.
  • Re-designed and developed a critical ingestion pipeline to process over 5TB of data/day.
  • Optimized Data sets by partitioning/converting files from text to ORC, apply map joins wherever necessary.
  • Explored PySpark framework on Azure Data Databricks for improving the performance and optimization of the existing algorithms in Hadoop using PySpark Core, Spark SQL and Spark Streaming APIs.
  • Design/Automate data synchronize process from Prod Hadoop cluster to Dev using Shell Scripting.
  • Development of validation queries to certify data loaded successfully into Data Lake comparing source and target systems.
  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Performance tuning to optimize the run time of JOBS dealing with huge data in terabytes of size.
  • Perform Data analysis on existing ETL data stage jobs and realize them in distributed computing framework like Hadoop.
  • Develop the automation scripts Using PYTHON to transfer the data from on premise clusters to Google Cloud Platform (GCP).
  • Load the files data from ADLS Server to the Google Cloud Platform Buckets and create the Hive Tables for the end users.
  • Explored PySpark framework on Azure Data Databricks for improving the performance and optimization of the existing algorithms in Hadoop using PySpark Core, Spark SQL and Spark Streaming APIs.
  • Implemented massive transformation and scheduling on Azure Data Bricks for advanced data analytics and provided data to downstream applications.
  • Creating from scratch a new continuous integration stack based on Docker and Jenkins, allowing transition from dev stations to test servers easily and seamlessly.
  • Reduced build & deployment times by designing and implementing Docker workflow.
  • Used Maven dependency management system to deploy snapshot releases and release artifacts to nexus to share artifacts across the projects. Writing container code for Docker, Docker Swarm, Kubernetes Working with other departments to requisition and configure new Azure and/or on-premises infrastructure.
  • Broad understanding of tools and technologies with a preference for Azure, Jenkins, Bitbucket/Git, and Kubernetes/Helm.

Environment: Oracle, SQL server, Python, ETL, HDFS, Sqoop, Hive, Spark, Scala, Google Cloud Platform, Java, Automic, Hadoop, DB, AORTA, Shell Scripting

Confidential, Greenwood Village, CO

Data Engineer

Responsibilities:

  • Developed PySpark Applications by using python and Implemented Apache PySpark data processing project to handle data from various RDBMS and Streaming sources.
  • Develop a data set process for data mining and data modeling and also recommend the ways to improve data quality, efficiency and reliability.
  • Built data pipelines for reporting, alerting, and data mining. Experienced with table design and data management using HDFS, Hive, Impala, Sqoop, MySQL, and Kafka.
  • Worked on Apache Nifi to automate the data movement between RDBMS and HDFS.
  • Created shell scripts to handle various jobs like Map Reduce, Hive, Pig, Spark etc., based on the requirement.
  • Experience on Spark-SQL for processing the large amount of structured data.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Handled importing of data from various data sources, performed data control checks using PySpark and loaded data into HDFS.
  • Involved in converting Hive/SQL queries into PySpark transformations using Spark RDD, python.
  • Used PySpark SQL to Load JSON data and create Schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL.
  • Developed PySpark Programs using python and performed transformations and actions on RDD's.
  • Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
  • Developed JSON scripts for deploying the pipeline azure data factory that process the data using the SQL activity
  • Used PySpark and Spark SQL to read the parquet data and create the tables in hive using the python API.
  • Implemented PySpark using python and utilizing Data frames and PySpark SQL API for faster processing of data.
  • Developed python scripts, UDFs using both Data frames/SQL/Data sets and RDD/Map Reduce in Spark 1.6 for data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Experienced in handling large datasets using Partitions, PySpark in Memory capabilities, Broadcasts in PySpark, effective & efficient Joins, Transformations and other during ingestion process itself.
  • Processing the schema oriented and non-schema-oriented data using python and Spark.
  • Involved in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in HDFS.
  • Worked on streaming pipeline that uses PySpark to read data from Kafka, transform it and write it to HDFS.
  • Designed Data Marts by following Star Schema and Snowflake Schema Methodology, using industry leading Data Modeling tools.
  • Worked on Snowflake database on queries and writing Stored Procedures for normalization.
  • Worked with Snowflake’s stored procedures, used procedures with corresponding DDL statements, used JavaScript API to easily wrap and execute numerous SQL queries.

Environment: Cloudera (CDH3), AWS, Snowflake, HDFS, Pig 0.15.0, Hive 2.2.0, Kafka, Sqoop, Shell Scripting, Spark 1.8, Linux- Cent OS, Map Reduce, python 2, Eclipse 4.6.

Confidential, New York, NY

Big Data/Hadoop Developer

Responsibilities:

  • Designed cutting edge technical solutions for client needs, after evaluating options, including cloud-based solutions
  • Designed a cost-effective archival platform for storing bigdatausing Hadoop and its related technologies.
  • Provided technical architecture services to client, typically in the context of solutions that have been defined
  • CreatedDataLake by extracting customer'sdatafrom variousdatasources into HDFS.
  • This includesdatafrom Teradata, Mainframes, RDBMS, CSV and Excel.
  • Worked on all aspects ofdatamining,datacollection,datacleaning, model development,datavalidation, anddatavisualization.
  • Worked on programming foundation DAX within Power BI / Python for ETL.
  • Worked with Azure BLOB andData-lake storage and loadingdatainto Azure SQL Synapse analytics (DW).
  • Working with open-source Apache Distribution then Hadoop admins must manually setup all the configurations- Core-Site, HDFS-Site, YARN-Site and Map Red-Site. However, when working with popular Hadoop distribution like Hortonworks, Cloudera or MapR the configuration files are setup on startup and the Hadoop admin need not configure them manually.
  • Used Sqoop to import data from Relational Databases like MySQL, Oracle.
  • Involved in importing structured and unstructured data into HDFS.
  • Responsible for fetching real-time data using Kafka and processing using Spark and Scala.
  • Worked on Kafka to import real-time weblogs and ingested the data to Spark Streaming.
  • Developed business logic using Kafka Direct Stream in Spark Streaming and implemented business transformations.
  • Worked on Building and implementing real-time streaming ETL pipeline using Kafka Streams API.
  • Implemented Hive Partitioning and Bucketing on the collected data in HDFS.
  • Implemented Sqoop jobs for large data exchanges between RDBMS and Hive clusters.
  • Extensively used Zookeeper as a backup server and job scheduled for Spark Jobs.
  • Developed Spark scripts using Scala shell commands as per the business requirement.
  • Worked on Cloudera distribution and deployed on AWS EC2 Instances.
  • Experienced in loading the real-time data to a NoSQL database like Cassandra.
  • Experience in retrieving the data present in Cassandra cluster by running queries in CQL (Cassandra Query Language).
  • Worked on connecting Cassandra database to the Amazon EMR File System for storing the database in S3.
  • Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
  • Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
  • Well versed in using Elastic Load Balancer for Auto scaling in EC2 servers.
  • Coordinated with the SCRUM team in delivering agreed user stories on time for every sprint.

Environment: Hadoop, Map Reduce, Hive, Spark, Oracle, GitHub, Tableau, UNIX, Cloudera, Kafka, Sqoop, Scala, NIFI, HBase, Amazon EC2, S3.

Confidential

ETL Developer

Responsibilities:

  • Performeddatacleaning, filtering and transformation to develop newdatainsights.
  • Practiced database query languages and technologies (Oracle, SQL, Python) to retrievedata.
  • Gathered of business needs fordatainsights and analysis, creation of supportive visualizations and preparation ofdatatogether withdataengineers and architects.
  • Provided platform and infrastructure support, including cluster administration and tuning.
  • Acted as Liaise between Treasury lines of business and Technology team and otherDataManagement analysts to communicate control status and escalate issues according to defined process
  • Created and managed bigdatapipeline, including Pig/MapReduce/hive
  • Installed and configuring Hadoop components on multiple clusters
  • Worked collaboratively with users and application teams to optimize query and cluster performance
  • Analyzed business requirements, transformed data and mapped source data using the Teradata Financial Services Logical Data Model Tool, from the source system to the Teradata Physical Data Model.
  • Implemented historical purge process for Clickstream, order broker &link tracker to TD using Abinitio.
  • Implemented the centralized graphs concept.
  • Extensively used Abinitio components like Reformate, rollup, lookup, joiner, re-defined and developed many sub graphs.

Environment: Abinitio, Oracle, Database, Clickstream, Reformate, Rollup, Lookup, UNIX, and Extract-Replicate.

We'd love your feedback!