Sr. Azure Data Engineer Resume
Albany New, YorK
SUMMARY
- 7+ years of professional software development experience with 5+ years of expertisein Big Data, Hadoop Ecosystem, Cloud Engineering, Data Warehousing.
- Experience in loading data to Azure Data Lake, Azure SQL Data, Azure SQL Database, and building data pipelines using Azure Databricks, Azure Data Factory.
- Good experience with Azure services like HDInsight, Active Directory, Storage Explorer, Stream Analytics.
- Extensive experience in Azure Cloud Services (PaaS & IaaS), Storage, Data - Factory, Data Lake (ADLA &ADLS), Active Directory, Synapse, Logic Apps, Azure Monitoring, Key Vault, and Azure SQL.
- Great experience on working with Microsoft Azure for building, managing, and deploying applications and created Azure virtual machines.
- Good knowledge in identifying and troubleshooting, connectivity and other issues for the applications hosted in Azure platform.
- Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like VPC, DynamoDB, Route 53, Elastic Container Services (ECS), Security Groups, CloudWatch, EC2, S3, Security Groups, Kinesis, Red shift, IAM, CloudFormation, ELB, Cloud Front, Elastic Beanstalk (EBS).
- Implemented Terraform deployments to manage the AWS infrastructure and managed servers using configuration tools like Chef and Ansible. Also created Terraform scripts for EC2 instances, Elastic Load balancers and S3 buckets.
- Experience in using Python included Boto3 to supplement automation provided by Ansible and Terraform for tasks such as encrypting Elastic Beanstalk volumes and scheduling Lambda functions for routine AWS tasks.
- Implemented monitoring and established best practices around using Elasticsearch and used AWSLambda to run code without managing servers.
- Expertise in developing applications using Big Data ecosystem - Hadoop (HDFS, Yarn, MapReduce), Pig, Sqoop, Zookeeper, Spark, Flame, Hive.
- Hands-on experience in developing pipeline using Hive (HQL) to retrieve the data from Hadoop cluster, Oracle database and used ETL for transforming data.
- Good knowledge in designing and developing Hive databases in organizing data and used various techniques like bucketing, partitioning to store data.
- Implemented Hadoop jobs on a EMR cluster performing several Spark, Hive and MapReduce jobs to process data to build recommendation Engines and Behavioral Insights.
- Experience in using Cloudera Manager and Cloudera Director to install separate Hadoop Clusters for Development and Testing.
- Converted Hive/MySQL/HBase queries into Spark RDD’s using Spark transformations and Scala.
- Developed Spark applications to extract, transform and aggregate data from different formats to draw insights into customer patterns using PySpark and Spark-SQL.
- Strong working experience withSQLandNoSQLdatabases (DynamoDB, HBase), tuning, data modeling, backup, disaster recovery and creating data pipelines.
- Good experience in scripting languages such as Python (PySpark), Scala and Spark-SQL for development, aggregation from various file formats like JSON, CSV, Parquet, XML, ORC and Avro.
- Performed data wrangling to clean, transform and reshape the data utilizing panda’s library.
- Analyzed data using SQL, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
- Highly involved in all phases of SDLC process using Waterfall and Agile Scrum methodologies.
- Coordinate with cross-functional teams to execute short and long-term product delivery strategies, with a successful track record of implementing best business practices.
- Good communication and strong interpersonal and organizational skills with the ability to manage multiple projects. Always willing to learn, adopt new technologies.
TECHNICAL SKILLS
Azure services: Azure Data Factory, Databricks, Azure Active Directory, Blob Storage, Data Lake, SQL Database, SQL Data Warehouse.
AWS services: EC2, S3, EMR, Glue, RDS, Elasticsearch, SQS, EBS, Lambda, Athena, Kinesis, ECS, DynamoDB, Quick Sight, Redshift.
Big Data Ecosystem: HDFS, Hive, Yarn, Spark, MapReduce, HBase, Spark, Zookeeper, Airflow, Stream Sets, Oozie, Sqoop, Flume, Pig.
Databases: MySQL, Oracle, Teradata, MS SQL, Dynamo DB.
ETL/BI Tools: Snowflake, Informatica, Tableau, Power BI
Hadoop Distribution: Horton Works, and Cloudera.
Scripting Languages: Python, PySpark, Scala Spark, SQL, PowerShell Scripting.
Operating systems: Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10)
Version Control: Git, SVN, Bitbucket
Development Methodologies: Agile, Waterfall
IDE and Build Tools: Jupyter Notebook, Terraform, Anaconda, Jira
PROFESSIONAL EXPERIENCE
Sr. Azure Data Engineer
Confidential, Albany, New York
Responsibilities:
- Design, build, test, and maintain end-to-end data pipelines, data integration, ETL processes, and data management delivery within Azure Cloud using Azure Data Factory, Azure Data Lake Storage and Azure Databricks.
- Ingesting/Migrating data, applying transformation logic, and continuous data quality checks on the ingested data from various sources.
- Determining the lifecycle from analysis to production, with a focus on data validation, defining logic and performing transformations according to the business requirements, and creating end to end ETL data pipelines.
- Manipulated semi-structured and unstructured data using Azure Databricks into Bronze-Sliver-Gold Zones using PySpark as programming language.
- Using Azure Data Factory(V2), created ingestion pipelines from different sources into Azure Data Lake Storage.
- Created and maintained Azure resources using a combination of Windows PowerShell and Azure Resource Manager (ARM) templates for unit testing during the DevOps process.
- Moving data from Azure Blob storage to Azure Data Lake Storage using Azure Data Factory pipelines.
- Configuring and developing Azure Databricks notebooks using PySpark and Spark SQL for data transformation, aggregations, and extractions from multiple file formats for analyzing the data.
- Involved into Application Design and Data Architecture using Cloud and Big Data solutions on Azure.
- Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution and re-designing the Legacy Application solutions with minimal changes to run on cloud platform.
- Maintain code repositories for Databricks notebooks, Data Factory pipelines in GitHub.
- Created and developed the Stored Procedures, Joins and Triggers to handle complex business rules within Azure environment.
- Wrote complex SQL statements using CTE’s, Correlated Subqueries, and Joins.
- Part of the Agile Team and worked on bi-weekly sprints, daily standups, sprint planning, stakeholders’ demo, production rollout, and signoff.
- Keeping track of the process/progress using the JIRA ticketing system and reviewing tickets before moving them on the Kanban board for sprint achievements.
- Used Agile methodology in developing the application, which included iterative application development, customer reporting, backlog grooming and weekly Sprints.
Environment: Azure Data Factory, Azure Data Lake Storage(ADLS), Azure Databricks, PowerBI, PySpark, Python, Spark SQL, PowerShell scripting, ETL, GIT, Kanban, Jira.
AWS Data Engineer
Confidential, Dallas, Texas
Responsibilities:
- Included in code migration of quality monitoring tool from Amazon EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses.
- Implemented and setup AWSshield, AWS, config, Amazon Macie, and Amazon inspector for security and protection of sensitive data.
- Automation of cloud infrastructure using Terraform, and application configuration and deployment.
- Creating and managing access to AWS services for IAM user accounts and for role-based users.
- Using Tableau, designing dashboard to show operational metrics.
- Hands-on experience integrating AWS services: EC2, S3, Network Protocol, Transit VPC, VPC Peering, VPC Endpoints, VPC Private Link.
- Developed multiple POC’s using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
- Used Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
- Evaluated Snowflake design considerations for any change in the application.
- Worked on Spark data bricks cluster for estimating the cluster’s size, monitoring, and troubleshooting on AWScloud.
- Implemented Installation and configuration of the multi-node cluster on Cloud using Amazon Web Services.
- Experienced with Spark Streaming and AWS Kinesis for real-time data processing.
- Configured the services S3, AWS Glue, EC2 using python Boto 3.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Worked on Kibana dashboards based on log stash data and integrated several source and target systems into Elastic search for near real-time log analysis of end-to-end transaction monitoring.
- Integrated Apache Airflow with AWS to monitor multi-stage machine learning processes with Amazon SageMaker jobs.
- Worked on S3 bucket in AWSextensively and moved data from HDFS to AWSSimple Storage Service.
- Actively participate with Client Stakeholders to gather business requirements and document them for the project plan.
Environment: AWS, PySpark, Spark Streaming,EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, AmazonSageMaker, Apache Spark, HBase, Apache, HIVE, Map Reduce, Snowflake, Pig, Python, NumPy, Pandas, SSRS, Tableau.
Big Data Engineer/Hadoop Developer
Confidential
Responsibilities:
- Extract real-time data feed using Kafka, process core job using Spark Streaming to Resilient Distributed Datasets (RDD) to process them as Data Frames and save as Parquet format in HDFS and NoSQL databases.
- Using Hive and Pig scripts solved performance issues understanding Joins, Groups, and aggregation on how it translates to MapReduce jobs.
- Extracting batch and Real time datafrom Oracle, DB2, Teradata, Netezza, SQL server to Hadoop (HDFS) using Sqoop, Teradata TPT, Apache Storm, Apache Kafka.
- Configured Zookeeper to manage Kafka cluster nodes, coordinate the brokers/cluster topology.
- Installed, Configured and Maintained the Hadoop cluster for application development and Hadoop ecosystem components like Hive, Pig, HBase, Zookeeper and Sqoop.
- Handled Hadoop cluster installations in Windows environment.
- Created tables in Snowflake DB, loading and analyzing data using Scala-Spark scripts.
- Developed ETL pipelines in and out of data warehouse using combination of Snowflake’s Snow SQL and Python.
- Worked on configuring and managing disaster recovery and backup on Cassandra Data.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Configuring Zookeeper to coordinate the servers in clusters, maintaining data consistency and participate in migrating the jobs to a newer version of Zookeeper.
- Implemented partitioning, dynamic partitions, and buckets in Hive.
- Used Flume to collect, aggregate, and store the weblog data from different sources like web servers, mobile and network devices, and pushed to HDFS.
- Supported in setting up the QA environment and updating configurations for implementing scripts with Pig, Hive, and Sqoop.
- Developed data pipeline using Sqoop to ingest cargo data and customer histories into HDFS for analysis.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS), Teradata and vice versa.
- Participated in the requirement gathering and analysis phase of project in documenting and conducting meetings/workshops with various business users.
Environment: - Hive 2.3, Pig 0.17, Python, HDFS, Sqoop, Hadoop 3.0, Azure, AWS, NoSQL, Sqoop 1.4, Oozie, Power BI, Agile, Zookeeper.
Spark Developer/Scala Developer
Confidential
Responsibilities:
- Extracted the data from the csv files for different years and created a parallelized RDD in a local Spark Session.
- Converted the raw data from the extracted csv files to readable and processable content in the RDDs.
- Used PySpark Data Frames for converting the distributed collection of data organized into named columns using Databricks Lakehouse Architecture.
- AWS S3 data lake and further processed it using PySpark.
- Validating data sets by implementing Spark components.
- Analyzed data stored in S3 buckets using SQL and PySpark.
- Performed data operations like Text Analytics and Data Processing, using the in-memory computing capabilities of Spark using Scala.
- Used Spark-SQL to read, process the parquet data, and create the tables using the Scala API.
- Monitored Spark Application to capture the logs generated by Spark jobs.
- Using Python, SQL on many datasets to obtain metrics that perform ETL operations.
- Played an important role in migrating jobs from spark 0.9 to 1.4 to 1.6
- Converted Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Consumed the data from Kafka using Apache Spark.
- Designing the solutions for SQL scripts to implement using Scala.
- Used Scala collection framework and Scala functional programming concepts to store and process complex consumer information to develop business logic.
- Developed various Map Reduce applications to perform ETL workloads on terabytes of data.
- Designed and implemented Apache Spark Application (Cloudera).
- Experienced in Scala functional programming using Closures, Currying, and monads.
- Developed analytical components using Spark, Scala, and Spark stream.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Developed Spark scripts by using Scalashell commands as per the requirement.
Environment: Hadoop, Spark, Scala, SQL, Cloudera Manager, Teradata, Hive.
Data Analyst
Confidential
Responsibilities:
- Developing new data flows from existing and implemented data flows as per requirements.
- Collecting, interpreting, and analyzing the data, identifying patterns and trends in data sets.
- Working with multiple data sources - flat files, on premise and cloud.
- Hands-on experience with extracting insights from large data sets and build reporting models and dashboards.
- Worked on various Python libraries for developing and testing of code for datatransformations.
- Defining the efficient marketing strategies that are used to transfer data from one DB to another DB.
- Analyzed data to predict trends in the customer base and the consumer population performing statistical analysis of data.
- Data Analysis on various projects were built and maintained using SQL scripts, indexes, and complex queries.
- Executed SQL queries using Query Analyzer that helps in generating reports for marketing and sales.
- Gathering requirements and creating use cases and creating backlogs for product.
- Experience in using multiple tools such as Excel and Tableau which are used for data visualization.
- Prepared reports in Tableau that are involved in Business according to the requirements.
- Using open, high, low, close data for predicting future prices using MATLAB.
- Used R programming for data exploration and analysis.
- Implemented Data Analytics Project Plans and prepared testing strategies for projects among several groups.
- Performing analysis based on results derived from SQL, MS Office, Excel, and Visual Basic Scripts.
Environment: Python, SQL, Tableau, Excel, MS Office.
