We provide IT Staff Augmentation Services!

Aws Data Engineer Resume

4.00/5 (Submit Your Rating)

Columbus, OH

PROFESSIONAL SUMMARY:

  • Around 8 years of Professional experience as a Data Engineer with expertise in Python, Spark, Hadoop Ecosystem, and cloud services etc.
  • Development, Implementation, Deployment and Maintenance using Bigdata technologies in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Scala, Yarn, Kafka, PIG, HIVE, Sqoop, Flume, Oozie, Impala, HBase, Nifi, Ambari.
  • Extensive experience in developing applications that performDataProcessing tasks using Teradata, Oracle, SQL Server, and MySQL databases.
  • Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka and PowerBI.
  • Acquired profound knowledge in developing production ready Spark applications utilizing Spark Core, Spark Streaming, Spark SQL, Data Frames, Datasets and Spark-ML.
  • Profound experience in creating real time data streaming solutions using PySpark/Spark Streaming, Kafka.
  • Worked on NoSQL databases including HBase, Cassandra and Mongo DB.
  • Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
  • In-depth Knowledge of Hadoop Architecture and its components such as HDFS, Yarn, Resource Manager, Node Manager, Job History Server, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce.
  • Experience developing iterative algorithms using Spark Streaming in Scala and Python to build near real-time dashboards.
  • Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase.
  • Expertise in working with AWS cloud services like EMR, S3, Redshift, EMR, Lambda, DynamoDB, RDS, SNS, SQS, Glue, Data Pipeline, Athena for big data development.
  • Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
  • Worked on data processing and transformations and actions in spark by using Python (Spark) language.
  • Experienced Orchestrating, scheduling, and monitoring job tools like Oozie, and Airflow.
  • Extensive experience utilizing Sqoop to ingest information from RDBMS - Oracle, MS SQL Server, Teradata, PostgreSQL, and MySQL.
  • Worked with different ingestion services with Batch and Real-time data handling utilizing Spark streaming, Kafka, Storm, Flume and Sqoop.
  • Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
  • Expertise in python scripting and Shell scripting.
  • Acquired experience in Spark scripts in Python, Scala, and SQL for advancement in development and examination through analysis.
  • Proficient in building PySpark and Scala applications for interactive analysis, batch processing, and stream processing.
  • Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
  • Strong understanding ofDataModeling and ETL process indatawarehouse environment such as star schema, snowflake schema.
  • Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
  • Strong working knowledge across the technology stack including ETL, data analysis, data cleansing, data matching, data quality, audit, and design.
  • Experienced working on Continuous Integration & build tools such as Jenkins and GIT, SVN for version control.
  • Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.

TECHNICAL SKILLS:

Big Data Eco System: HDFS, Spark, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Airflow, Zookeeper, Amazon Web Services.

Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP

Languages: Python, Scala, Java, Pig Latin, HiveQL, Shell Scripting.

Software Methodologies: Agile, SDLC Waterfall.

Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake.

NoSQL: HBase, MongoDB, Cassandra.

ETL/BI: Power BI, Tableau, Informatica.

Version control: GIT, SVN, Bitbucket.

Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.

Cloud Technologies: Amazon Web Services, EC2, S3, SQS, SNS, Lambda, EMR, Glue, Code Build, CloudWatch. Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory).

PROFESSIONAL EXPERIENCE:

Confidential, Columbus, OH

AWS Data Engineer

Responsibilities:

  • Migrated terabytes of data from the data warehouse into the cloud environment in an incremental format.
  • Worked on creating data pipelines with Airflow to schedule PySpark jobs for performing incremental loads and used Flume for weblog server data. Created Airflow Scheduling scripts in Python.
  • DevelopedSparkjobs onDatabricksto perform tasks like data cleansing, data validation, standardization, and then applied transformations as per the use cases.
  • Created several Databricks Spark jobs with PySpark to perform several tables to table operations.
  • Performed data cleansing and applied transformations usingDatabricksandSparkdata analysis.
  • Developed spark applications in PySpark and Scala to perform cleansing, transformation, and enrichment of the data.
  • Design and Develop ETL processes in AWS Glue to migrate data from external sources like S3, Mysql, Parquet files into AWS Redshift.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Create PySpark Glue jobs to implement data transformation logics in AWS and stored output in Redshift cluster.
  • Wrote various data normalization jobs for new data ingested into Redshift.
  • Utilized Spark-Scala API to implement batch processing of jobs.
  • Developed Spark-Streaming applications to consume the data from Snowflake and to insert the processed streams to DynamoDB.
  • Using Spark, performed various transformations and actions and the result data is saved back to HDFS from there to target database Snowflake.
  • Utilized Spark in Memory capabilities, to handle large datasets and used Broadcast variables in Spark, effective & efficient Joins, transformations, and other capabilities for data processing.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Fine-tuning spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
  • Creating Hive tables, loading and analyzing data using hive scripts.
  • Worked with various formats of files like delimited text files, click stream log files, Apache log files, Avro files, JSON files, XML Files. Mastered in using different columnar file formats like ORC and Parquet formats.
  • Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
  • Designed custom-built input adapters using Spark, Hive, and Sqoop to ingest and analyze data (Snowflake, MS SQL, MongoDB) into HDFS.
  • Data Extraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark
  • Created Spark JDBC APIs for importing/exporting data from Snowflake to S3 and vice versa.
  • Automated the data pipeline to ETL all the Datasets along with full loads and incremental loads of data.
  • Utilized AWS services like EMR, S3, Glue Meta store and Athena extensively for building the data applications.
  • Involved in creating Hive external tables to perform ETL on data that is produced on daily basis.
  • Generated reports and visualizations based on the insights mainly using AWS QuickSight and developed dashboards.
  • Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer.
  • UsedJenkinspipelines to drive all micro-services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
  • Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.

Environment: Python, Redshift, SQL, Oracle, Hive, Scala, Power BI, Docker, Athena, Aws Glue Mongo DB, Kubernetes, SQS, PySpark, Kafka, Data Warehouse, Big Data, MS SQL, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, Lambda, Glue, ETL, Databricks, Snowflake, AWS QuickSight, AWSDataPipeline.

Confidential, Cary, NC

Data Engineer

Responsibilities:

  • Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper and Flume.
  • Data warehouse, Business Intelligence architecture design and development. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
  • Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL Azure Data Lake Analytics.
  • Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
  • Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
  • Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.
  • Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.
  • Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames.
  • Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using Pyspark and shell scripting.
  • Ingested gigantic volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services.
  • Lead in Installation, integration, and configuration of Jenkins CI/CD, including installation of Jenkins plugins.
  • Implemented aCI/CDpipeline withDocker,Jenkins,andGitHubby virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization.
  • Installing, configuring, and administering Jenkins CI tool using Chef on AWS EC2 instances.
  • Worked on AWS Data Pipeline to configure data loads from S3 to Redshift.
  • Performed Code Reviews and responsible for Design, Code, and Test signoff.
  • Worked on designing and developing the Real - Time Tax Computation Engine usingOracle, Stream Sets, Kafka, Spark Structured Streaming.
  • Validated data transformations and performed End-to-End data validations for ETL workflows loading data from XMLs to EDW.
  • Extensively utilized Informatica to create complete ETL process and load data into database which was to be used by Reporting Services.
  • Created Tidal Job events to schedule the ETL extract workflows and to modify the tier point notifications.

Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Azure Data Factory, Data Lake, Docker, Mongo DB, Kubernetes, PySpark, SNS, Kafka, Data Warehouse, Sqoop, Pig, Zookeeper, Flume, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, GCP, Lambda, Glue, ETL, Databricks, Snowflake, AWSDataPipeline.

Confidential, Atlanta, GA

Big Data Developer

Responsibilities:

  • DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
  • Imported Legacy data from SQL Server and Teradata into Amazon S3.
  • Created consumption views on top of metrics to reduce the running time for complex queries.
  • Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
  • Compare the data in a leaf level process from various databases when data transformation or loading takes place.
  • Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further analysis
  • Profile structured, unstructured, and semi-structured data across various sources to identify patterns in data and Implement data quality metrics using necessary queries or python scripts based on source.
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
  • Worked on analyzing Hadoop Cluster and different big data analytic tools and with various HDFS file formats like Avro, Sequence File, Json.
  • Working experience with data streaming process with Kafka, Apache Spark, Hive.
  • Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
  • Developed Spark scripts by usingScalashell commands as per the requirement.
  • ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data is stored inS3 Buckets.
  • Implemented the workflows using ApacheOozieframework to automate tasks.
  • ImplementedSpark RDDtransformations, actions to implement business analysis.
  • DevelopedSparkscriptsby usingScalashell commands as per the requirement.
  • Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
  • Worked on Ingesting high volumes of tuning events generated by client Set-top boxes from elastic search in batch mode and from Amazon Kinesis Streaming in real-time via Kafka brokers to Enterprise Data Lake using Python and NiFi.
  • Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
  • Developed data pipelines with NiFi, which can consume any format of real-time data from a Kafka topic and pushing this data into enterprise Hadoop environment.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Environment: HDFS, MapReduce, Snowflake, Pig, Nifi, Hive, Kafka, Spark, PL/SQL, AWS, S3 Buckets, EMR, Scala, Sql Server, Cassandra, Oozie.

Confidential

Big Data Engineer

Responsibilities:

  • Developed Spark streaming model which gets transactional data as input from multiple sources and create multiple batches and later processed for already trained fraud detection model and error records.
  • Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Developed DDL’s and DML scripts in SQL and HQL for creating tables and analyzing the data in RDBMS and Hive.
  • Used Sqoop to import and export data fromHDFStoRDBMSand vice-versa.
  • Created Hive tables and involved in data loading and writing Hive UDFs.
  • Exported the analyzed data to the relational databaseMySQLusingSqoopfor visualization and to generate reports.
  • Loaded the flat files data using Informatica to the staging area.
  • Researched and recommended suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis. Used Sqoop to transfer data between relational databases and Hadoop.
  • Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
  • Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
  • Analyzed data stored in S3 buckets using SQL, PySpark and stored the processed data in Redshift and validated data sets by implementing Spark components
  • Worked as ETL developer and Tableau developer and widely involved in Designing, Development Debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations, and complex calculations to manipulate the data using Tableau Desktop

Environment: Spark, Hive, Python, HDFS, Sqoop, Tableau, HBase, Scala, MySQL, Impala, AWS, S3, EC2, Redshift, Tableau, Informatica

We'd love your feedback!