Aws Data Engineer Resume
Brookfield, WI
PROFESSIONAL SUMMARY:
- 7+ years of combined experience as a Data Engineer, Hadoop Developer, ETL Developer and expertise in data modeling, analytical processing, and deployment for enterprise - level applications.
- Excellent understanding/knowledge of Hadoop architecture and several Hadoop ecosystem components such as HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Spark, Flume, Zookeeper, Hue, Kafka, and Impala.
- Worked with Spark Context, Spark-SQL, Pair RDD's, and Spark YARN to improve the performance and optimization of existing Hadoop algorithms. Good command on Spark components such as Spark SQL, MLib, Spark Streaming.
- Worked on data processing, transformations, and actions using PySpark.
- Experience with Microsoft Azure Cloud services such as Azure Data Factory (ADF), Azure Data Lake Storage (ADLS), Azure Synapse Analytics (SQL Data warehouse), Azure SQL Database, Azure Analytical services, CosmosDB, Azure Key vaults, and Azure HDInsight
- Experience in Azure Data Factory to create data pipeline and schedule data-driven workflows that can ingest data from discrete data stores.
- Experience in various AWS services such as EC2, S3, Redshift, DynamoDB, Glue, Lambda, Step Functions, API Gateway, Kinesis, ELB, EBS, RDS, IAM, EFS, CloudFormation, Route 53, CloudWatch, SQS, SNS
- Design and develop ETL processes in AWS Glue to migrate Campaign data into AWS Redshift from internal sources like SFTP servers.
- Hands on experience with Hadoop file formats such as Sequence, ORC, Parquet as well as open source Text/CSV and JSON formatted files.
- Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications. Experience on database performance tuning and data modeling, query optimization.
- Developed PySpark code in DataBricks to validate and load into Snowflake database.
- Created Spark clusters and configuring high concurrency clusters using Azure Databricks to speed up the preparation of high-quality data.
- Expertise in writing MapReduce Jobs in Python for processing large sets of structured, semi structured and unstructured data sets and stores them in HDFS.
- Worked on data cleaning and reshaping, generated segmented subsets using Numpy and Pandas in Python.
- Experienced with version control tools like Git and Gitlab, project management tools such as JIRA, and various software development methodologies.
- Worked with SQL databases such as MySQL and Oracle to design databases, define its structure, and execute various functions on them.
- Experience in creating Ad-hoc Queries to move data from HDFS to HIVE and analyze it using HIVE QL.
- Worked on NoSQL databases such as Hbase, Cassandra, and MongoDB.
- Expertise in Data warehousing concepts, Dimensional Modeling, Data Modeling, OLAP and OLTP systems.
- Scheduling Spark and Scala jobs in Hadoop Cluster using Oozie workflow and generating detailed design documentation for source-to-target transformations.
- Experience configuring Zookeeper to maintain data consistency and coordinate servers in clusters.
- Strong knowledge of the SDLC methodologies, such as Waterfall and Agile.
- Good knowledge about Agile methodology, including user stories, burn down charts, sprints, and the continuous integration process in an agile project.
- Created interactive dashboards and reports as per requirements using Tableau.
TECHNICAL SKILLS:
Hadoop/Big Data Technologies: Hadoop, Map Reduce, Sqoop, Hive, Oozie, Spark, Zookeeper and Cloudera Manager, Kafka, Flume
ETL Tools: Informatica
NO SQL Database: HBase, Cassandra, MongoDB, Dynamo DB
Monitoring and Reporting: Tableau, Custom shell scripts
Hadoop Distribution: Horton Works, Cloudera, MapReduce
Build Tools: Maven
Programming & Scripting: Python, Scala, SQL, Shell Scripting
Databases: Snowflake, Oracle, MY SQL, MS SQL Server, Teradata
Version Control: Git, GitHub, SVN
Operating Systems: Linux, Unix, Mac OS-X, Windows 10, Windows 8
Cloud Computing: AWS, Azure
PROFESSIONAL EXPERIENCE:
Confidential, Brookfield, WI
AWS Data Engineer
Responsibilities:
- Migrated multiple applications from on-prem Hadoop Ecosystem running as a monolithic applications to AWS cloud.
- Supported the data science applications used to predict churn rate, fraudulent transactions and stock prediction.
- Developed and worked on multiple on-prem applications using Hadoop Ecosystem frameworks and tools such as HDFS, MapReduce, Yarn, Pig, Hive, Sqoop, Storm HBase, Kafka, Flume, NiFi.
- Migrated on-prem applications to AWS cloud using services such as EC2, S3, Lambda, EMR, RDS, CloudWatch, SNS, SES, SQS, VPC, IAM, Elastic Load Balancing, Auto Scaling and others.
- Imported and exported data from varioussources into HDFS using Sqoop and Flume.
- Created multiple MapReduce Programs for data extraction, data transformation and aggregation from multiple data formats including XML, JSON and other compressed data formats.
- Designed and Modeled Hive database using Partitioned and Bucketing tables with storing data in various file systems like JSON, CSV, XML and Parquet.
- Worked onSQL Server Integration Services (SSIS) and Data Transformation Services (DTS Packages) to import and export databases.
- Installation, Configuration and Management of RDBMS and NoSQL databases and tools.
- Worked on NoSQL database like MongoDB to store customer Data.
- Ingested data from multiple RDBMS and streaming sources and developed Spark applications in python.
- Handled bringing business data into HDFS through Sqoop, conducting transformations using Hive, Map Reduce, and finally loading data into HBase tables.
- Created Spark Streaming applications to process Kafka data in near real time, as well as data with stateless and stateful transformations.
- Worked on EC2 and S3 to process and store small data sets, and was familiar with AWS EMR cluster.
- Experienced in writing Python scripts to build ETL pipelines and Directed Acyclic Graph (DAG) workflows with Airflow and Apache NiFi.
- Developed and deployed AWS Lambda functions for building a serverless data pipeline that can be written to Glue Catalog and queried from Athena for ETL Migration services.
- Configured data loads from S3 to Redshift using the AWS Data Pipeline.
- Created S3 buckets policies using IAM role-based policies and loaded data using AWS Glue and PySpark.
- Created Glacier for backup on AWS.
- Created monitors, alarms, and notifications for EC2 hosts using Cloud watch and SNS (Simple Notification Service).
- Designed and developed ETL processes in AWS Glue to migrate Campaign data into AWS Redshift from internal sources like SFTP servers.
- Written several Map reduce Jobs using Pyspark with Numpy and Pandas libraries and used Jenkins for Continuous integration.
- Worked on cloud deployments using Maven, Docker, and Jenkins.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, Lambda, EC2, Redshift, RDS, Glue, Python, PySpark, Airflow, NiFi, MongoDB, Dynamo DB, Cloud watch, HDFS, MapReduce, Hive, Yarn, PIG, Apache Kafka, Sqoop, MySQL, GIT, Oozie, Jenkins.
Confidential
Data Engineer
Responsibilities:
- Designed and developed an ETL pipeline in the Azure cloud that takes customer data from an API and processes it and stores into an Azure SQLDB.
- Experience onApache Spark to process large amounts of data for faster processing and output enhancement.
- Worked with Microsoft's Azure cloud platform services namely HDInsight, Data Lake, Databricks, Stream Analytics, Blob Storage, Data Factory, Synapse, and Data Storage Explorer, NoSQL DB.
- Created linked services to connect external resources to Azure Data Factory(ADF)
- Created numerous pipelines in Azure using Azure Data Factory to gather data from various source systems using multiple Azure Activities such as Move &Transform, Copy activities.
- Ingested data into Azure Blob storage and processed the data using Databricks. Involved in writing Spark Scala scripts and UDF’s to perform transformations on large datasets.
- Performed ETL operations in Azure Databricks by connecting to several relational database source systems using JDBC connectors.
- Developed Python scripts to do file validations in Databricks and automated the process using ADF.
- Created Databricks Job workflows which extracts data from SQL server and upload the files to sftp using pyspark and python.
- Scheduled Spark and Scala jobs in Hadoop Cluster using Oozie workflow and generated detailed design documentation for source-to-target transformations.
- Orchestrated all Data pipelines using Azure Data Factory and built a custom alerts platform for monitoring.
- Worked on data processing like collecting, Aggregating, moving from various sources using Apache Flume and Kafka.
- Worked with SQL Server databases to create schemas for the project's physical data model, SSIS Packages, and external tables for extracting data from Azure Data Lake.
- Developed AZURE POWERSHELL scripts for copying and moving data from a local file system to HDFS Blob storage.
- Configured Input & Output bindings of Azure Function with Azure Cosmos DB collection to read and write data from the container whenever the function executes.
- Experience with Azure Data Lake Storage (ADLS), Data Lake Analytics and alsogood knowledge onhow to integrate with other Azure services.
- Involved in developing data ingestion pipelines on Azure HDInsight Spark cluster using Azure Data Factory and Spark SQL. Also Worked with Cosmos DB (SQL API and Mongo API).
- Created an automated procedure using selenium that can ingest data from a web service on a regular basis and load it into an Azure SQL database in the Azure cloud.
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Databricks.
- Extracted data from Azure Data Lake into HDInsight cluster, executed spark transformations and actions.
- Worked on NoSQL databases like HBase and Cassandra and developed real-time read/write access to very large datasets via HBase.
- Designed and developed a new solution to process NRT data using Azure stream analytics, Azure Event Hub, and Service Bus Queue.
- Extracted, transformed, and loaded the data from sources systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL, and Azure Data Lake Analytics.
- Implemented and enhanced the automated release process by which Azure DevOps (ADO) offering software flows from the developers to the customers. Defined the next generation CI/CD process and support test automation framework in the Microsoft cloud as part for the build engineering team.
- Worked on Agile methodology, including user stories, burn down charts, sprints, and the continuous integration process in an agile project.
Environment: Azure HDInsight, Databricks (ADBX), DataLake (ADLS), Azure SQLDB, MySQL, MongoDB, Cassandra, Flume, Teradata, Azure DevOps, Git, Data Factory, Data Storage Explorer, Blob Storage, Scala, Spark v2.0.2, Hadoop 2.x (HDFS), Airflow, HBase, Agile methodologies.
Confidential
Data Engineer
Responsibilities:
- Worked on Sqoop and the Hortonworks Connector for Teradatato export and import data from Teradata to Hive/HDFS.
- Worked on building an ETL data pipeline on Hadoop/Teradata with Hadoop/Pig/Hive/UDFs.
- Created Spark code in Scala and Spark SQL for faster processing and testing.
- Worked with AWS cloud services such as EC2, S3, EMR, and DynamoDB on big data.
- Managed imported data from various sources, transformations were performed using Hive, Pig, and Map-Reduce, and data was loaded into HDFS.
- Imported data from MS SQL Server, MySQL, and Teradata into HDFS using Sqoop.
- Transferred On-premise database schema to the Confidential Redshift data warehouse.
- Migrated HiveQL queries into SparkSQL to improve performance.
- Wrote HiveQL queries to integrate multiple tables and create views to generate a result set.
- Collected log data from Web Servers and merged it into HDFS using Flume.
- Visualized different input data using SQL and Zeppelin, and built rich dashboards.
- Worked on SparkSQL to load JSON data and build Schema RDDs, which were then loaded into Hive Tables and handled structured data.
- Participate in the design, development, and deployment of NoSQL databases such as MongoDB.
- Experience in define and deploy monitoring, metrics, and logging systems on AWS.
- Worked on Sqoopto import and export data between HDFS and RDBMS.
- Expertise with Hive SQL, Presto SQL, and Spark SQL for ETL workloads, as well as using the appropriate technology for the project.
- Used BitBucket and GIT for code versioning and CI/CD. Have good knowledge of GIT lifecycle.
- Conducted data blending and data preparation for Tableau using SQL.
Environment: Hadoop, Map Reduce, Yarn, Hive, Pig, Flume, Sqoop, AWS, Tableau, Core Java, Spark, Scala, MongoDB, Horton Works, Elastic Search 5.x, Eclipse.
Confidential
Hadoop Developer
Responsibilities:
- Created Enterprise Big Data warehouse and import data from many sources into BDA.
- Migrated traditional applications into Informatica IDQ using the Big Data cluster and its ecosystem.
- Transform and analyse data depending on ETL mappings using Spark and HIVE.
- Performed ETL and prepared data lakes for diverse domains using DataStage, Informatica BDM, and Exadata.
- Extract data from Teradata/Exadata to HDFS using Sqoop for the settlement and billing domain.
- Supporting the quality of IT solutions for business users by doing functional and regression testing.
- Created data frames for Hive tables and developed Spark scripts.
- Worked to import data into Hive harmonized tables, use Spark transformations on the source table.
- Performance tuning in Spark for various source system domains was developed and incorporated into the harmonized layer.
- Created automated scripts with Oozie and deployed them in production.
- Performed Black box testing for the Web based application an interface to mainframe.
- Created and Executed Automation Test Scripts for Functional and Regression Testing.
Environment: Hive, Spark, Python, Spark SQL, HDFS SAPBO, Sqoop, Cloudera, PySpark, Oozie
Confidential
SQL Developer
Responsibilities:
- Analysed, planned, and built databases using ER diagrams, normalization, and the relational database idea.
- Part of the system's design, development, and testing.
- Developed SQL Server stored procedures and SQL queries and fine-tuned (using indexes and execution plan).
- Created views and developed user-defined functions.
- Created triggers to ensure referential integrity.
- Implemented outstanding handling.
- Worked on a client's request and built complicated SQL queries to produce crystal reports.
- Designed and automated regular jobs.
- Tuned and optimized SQL queries using the execution plan and profiler.
- Built the controller component using Servlets and action classes.
- Created business elements (model components) using Enterprise Java Beans (EJB).
- Involved in planning, assessing, and documenting the development effort, including deadlines, risks, test needs, and performance targets, so that the team was able to establish a timetable and resource requirements.
- System requirements were analysed, and a system design document was created.
- Created a dynamic user interface using HTML and JavaScript utilizing JSP and Servlet technologies.
- Transmitted and receive messages using JMS components.
- Created the test director and performed test plans by using the quality centre.
- Mapped requirements to test cases in the quality centre.
- Supported system and user acceptability tests.
- Rebuilt indexes and tables as part of the performance optimization process.
- Part of the database backup and recovery process.
- Created documentation using MS Word.
Environment: MS SQL Server, SSRS, SSIS, SSAS, DB2, HTML, XML, JSP, Servlet, JavaScript, EJB, JMS, MS Excel, MS Word.