Sr. Azure Data Engineer Resume
SUMMARY
- Around 9 years of work experience in IT consisting of Data Analytics Engineering & as a Programmer Analyst. Experienced with cloud platforms like Amazon Web Services, Azure, Databricks (both on Azure as well as AWS integration of Databricks).Proficient with complex workflow orchestration tools namely Oozie, Airflow, Data pipelines and Azure Data Factory, CloudFormation & Terraforms.
- Implemented Data warehouse solution consisting of ETLS, On - premise to Cloud Migration and good expertise building and deploying batch and streaming data pipelines on cloud environments.
- Worked on Airflow 1.8(Python2) and Airflow 1.9(Python3) for orchestration and familiar with building custom Airflow operators and orchestration of workflows with dependencies involving multi-clouds.
- Leveraged Spark as ETL tool for building data pipelines on various cloud platforms like AWS EMRS, Azure HD Insights and MapR CLDB architectures.
- Career Interest and future aspirations include but not limited to: ML, AI, RPA & Automation everywhere motives.
- Spark for ETL follower, Databricks Enthusiast, Cloud Adoption & Data Engineering enthusiast in Open source community.
- Proven expertise in deploying major software solutions for various high-end clients meeting the business requirements such as Big Data Processing, Ingestion, Analytics and Cloud Migration from On-prem to Cloud.
- Proficient with Azure Data Lake Services (ADLS), Databricks & iPython Notebooks formats, Databricks Deltalakes & Amazon Web Services (AWS).
- Orchestration experience using Azure Data Factory, Airflow 1.8 and Airflow 1.10 on multiple cloud platforms and able to understand the process of leveraging the Airflow Operators.
- Developed and Deployed various Lambda functions in AWS with in-built AWS Lambda Libraries and also deployed Lambda Functions in Scala with custom Libraries.
- Expertise understanding of AWS DNS Services through Route53. Understanding of Simple, Weighted, Latency, Failover & Geolocational Route tynes.
- Architect and implement ETL and data movement solutions using Azure Data Factory (ADF), SSIS
- Develop Power BI reports & effective dashboards after gathering and translating end-user requirements.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database, and SQL Data warehouse environment.
- Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure. Data Center Migration, Azure Data Services have a strong virtualization experience.
- Experience in troubleshooting and resolving architecture problems including database and storage, network, security and applications.
- Experience managing Big Data platforms deployed in Azure Cloud.
- Implemented Copy activity, Custom Azure Data Factory Pipeline Activities for On-cloud ETL processing.
- Experience in Monitoring and Tuning SQL Server Performance.
- Experience in configuration of report server and report manager for job scheduling, giving permissions to a different level of users in SQL Server Reporting Services (SSRS).
- Expert in creating, debugging, configuring, and deploying ETL packages designed MS SQL Server Integration Services (SSIS). Configure SQL Azure firewall for a security mechanism.
- Work in wearing multiple hats: Azure Architect/System Engineering, network operations and data engineering.
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools).
- Collaborate with application architects on infrastructure as a service (IaaS) applications to Platform as a Service (PaaS). Deploy Azure Resource Manager JSON Templates from PowerShell.
- Experience in Performance Tuning and Optimization (PTO), Microsoft Hyper-V virtual infrastructure.Fluent programming experience with Scala, Java, Python, SQL, T-SQL, R.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Adept at configuring and installing Hadoop/Spark Ecosystem Components.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
- Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
- Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
- Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad++, and Visual Studio for development.
- Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression, Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means clustering.
- Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark architecture, data modeling, data mining, machine learning and advanced data processing.
- Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
- Developed Spark Applications that can handle data from various RDBMS (MySQL, Oracle Database) and Streaming sources.
- Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide range of applications.
TECHNICAL SKILLS
Hadoop/Spark Ecosystem: Hadoop, MapReduce, Pig, Hive/impala, YARN, Kafka, Flume, Oozie, Zookeeper, Spark, Airflow, MongoDB, Cassandra, HBase, and Storm.
Hadoop Distribution: Cloudera distribution and Horton works
Programming Languages: Scala, Hibernate, JDBC, JSON, HTML, CSS, SQL, R, Shell Scripting
Script Languages: JavaScript, jQuery, Python.
Databases: Oracle, SQL Server, MySQL, Cassandra, Teradata, PostgreSQL, MS Access, Snowflake, NoSQL, HBase,MongoDB
Cloud Platforms: AWS, Azure, GCP
Distributed Messaging System: Apache Kafka
Data Visualization Tools: Tableau, Power BI, SAS, Excel, ETL
Batch Processing: Hive, MapReduce, Pig, Spark
Operating System: Linux (Ubuntu, Red Hat), Microsoft Windows
Reporting Tools/ETL Tools: Informatica Power Centre, Tableau, Pentaho, SSIS, SSRS, Power BI
PROFESSIONAL EXPERIENCE
Confidential
Sr. Azure Data Engineer
Responsibilities:
- Involved in gathering requirements, design, implementation, deployment, testing and maintaining of the applications to meet the organization's needs using SCRUM methodology.
- Participated in scrum meetings and coordinated with Business Analysts to understand the business needs and implement the same into a functional design.Used Azure Data Factory extensively for ingesting data from disparate source systems.
- Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
- Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.
- Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.
- Designed and developed user defined functions, stored procedures, triggers for Cosmos DB
- Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.
- Take initiative and ownership to provide business solutions on time.
- Created High level technical design documents and Application design documents as per the requirements and delivered clear, well-communicated and complete design documents.
- Created DA specs and Mapping Data flow and provided the details to the developer along with HLDs.
- Created Build definition and Release definition for Continuous Integration and Continuous Deployment.
- Created Application Interface Document for the downstream to create a new interface to transfer and receive the files through Azure Data Share.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks. Created, provisioned different Databricksclusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders
- Improved performance by optimizing computing time to process the streaming data and saved cost to the company by optimizing the cluster run time. Perform ongoing monitoring, automation and refinement of data engineering solutions to prepare complex SQL views, stored procs in Azure SQL DW and Hyperscale.
- Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue.
- Created Linked service to land the data from SFTP location to Azure Data Lake.
- Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
- Created several Databricks Sparkjobs with PySpark to perform several tables to table operations.
- Extensively used SQL Server Import and Export Data tool.
- Created database users, logins and permissions to setup.
- Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.
- Also worked on a POC to check the compatibility issues before migrating few services to GCP.
- Develop and deploy the outcome using spark and Scala code in Hadoop cluster running on GCP.
- Using g-cloud function with Python to load Data into Bigquery for on arrival csv files in GCS bucket.
- Wrote an application to download a SQL Dump from their equipment maintenance site and then load it in a GCS bucket.
- Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
- Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
- Experienced in GCP features which include Google Compute engine, Google Storage, VPC, Cloud Load balancing, IAM.
- Implemented Google Cloud IAM roles on Organization, Project, and resource level
- Implemented and monitored GCP Cloud monitoring and Logging (Stackdriver).
- Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipeli
- Helping team members to resolve any technical issue, Troubleshooting, Project Risk & Issue identification and management Addressing resource issue, Monthly one on one, Weekly meeting.
- Redesigned the Views in snowflake to increase the performance.
- Unit tested the data between Redshift and Snowflake.
- Developed data warehouse model in snowflake for over 100 datasets using whereScape. Creating Reports in Looker based on Snowflake Connections
- Involved in the complete project life cycle starting from design discussion to production deployment.
Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure DataLake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning.
Confidential
Sr. Data Engineer-AWS
Responsibilities:
- Followed Agile Software Development Methodology to build the application iteratively and incrementally. Participated in scrum related activities and daily scrum meetings.
- Involved in gathering requirements, design, implementation, deployment, testing and maintaining of the applications to meet the organization's needs using SCRUM methodology.
- Participated in scrum meetings and coordinated with Business Analysts to understand the business needs and implement the same into a functional design.Worked in Server infrastructure development on AWS Cloud, extensive usage of Virtual Private Cloud (VPC), Cloud Formation, Lambda, Cloud Front, Cloud Watch, IAM, EBS, Security Group, Auto Scaling, Dynamo DB, Route53, and Cloud Trail.
- Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground up on Confidential Redshift for large scale data handling Millions of records every day.
- Supported AWS Cloud environment with 2000 plus AWS instances configured Elastic IP and Elastic storage deployed in multiple Availability Zones for high availability.
- Setup Log Analysis AWS Logs to ElasticSearch and Kibana and Manage Searches, Dashboards, custom mapping and Automation of data.
- Wrote python scripts to process semi-structured data in formats like JSON.
- Used ETL component Sqoop to extract the data from MySQL and load data into HDFS.
- Good hands on experience with Python API by developing Kafka producer, consumer for writing Avro Schemes.
- Managed Hadoop clusters using Cloudera. Extracted, Transformed, and Loaded (ETL) of data from multiple sources like Flat files, XML files, and Databases.
- Used Cloud Watch for monitoring the server's (AWS EC2 Instances) CPU utilization and system memory.
- Involved in the development of the UI using JSP, HTML5, CSS3, JavaScript, jQuery, AngularJS.
- Worked on JavaScript framework (Backbone.JS) to augment browser-based applications with MVC capability.
- Managed the artifacts generated by Maven and Gradle in the Nexus repository and Converted Pom.xml into build.
- Designed infrastructure for AWS application and workflow using Terraform and had done implementation and continuous delivery of AWS infrastructure using Terraform.
- Developed Python scripts to take backup of EBS volumes using AWS Lambda and Cloud Watch.
- Developed and deployed stacks using AWS Cloud Formation Templates (CFT) and AWS Terraform.
- Used Jenkins and pipelines which helped us drive all Microservices builds out to the Docker registry and then deployed to Kubernetes.
- Managed Docker orchestration and Docker containerization using Kubernetes
- Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
- Automated builds using Maven and scheduled automated nightly builds using Jenkins. Built Jenkins pipeline to drive all microservices builds out to the Docker registry and then deployed to Kubernetes.
- Extensively worked on Hudson, Jenkins for continuous integration and for End to End automation for all build and deployments.
- Resolved update, merge and password authentication issues in Bamboo and JIRA.Developed and maintained Python/Shell PowerShell scripts for build and release tasks and automating tasks.
- Used Jenkins pipelines to drive all micro services builds out to the Docker registry and then deployed to Kubernetes, Created Pods and managed using Kubernetes.
- Designed and implemented large scale business critical systems using Object oriented Design and Programming concepts using Python and Django.
- Experienced in working Asynchronous Frameworks like NodeJS, Twisted and designing the automation framework using Python and Shell scripting.
- Used Ansible Playbooks to setup and configure Continuous Delivery Pipeline and Tomcat servers. Deployed Micro Services, including provisioning AWS environments using Ansible Playbooks. automated various infrastructure activities like
- Continuous Deployment, Application Server setup, stack monitoring using Ansible playbooks and has Integrated Ansible with Jenkins.
- Prepared projects, dashboards, reports and questions for all JIRA related services.
- POC to explore AWS Glue capabilities on Data cataloging and Data integration
Environment: AWS (EC2, S3, EBS, ELB, RDS, SNS, SQS, VPC, Redshift, Cloud formation, CloudWatch, ELK Stack), Jenkins, Ansible, Python, Shell Scripting, PowerShell, GIT, Microservice, Jira, JBOSS, Bamboo, Kubernetes, Docker, Web Logic, Maven, Web sphere, Unix/Linux, Nagios, Splunk, AWS Glue.
Confidential
Hadoop Engineer
Responsibilities:
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily data.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Import the data from different sources like HDFS/HBase into Spark RDD
- Developed Spark scripts by using Python shell commands as per the requirement
- Issued SQL queries via Impala to process the data stored in HDFS and HBase.
- Used the Spark - Cassandra Connector to load data to and from Cassandra.
- Used Restful Web Services API to connect with the MapRtable. The connection to Database was developed through restful web services API.
- Involved in developing Hive DDLs to create, alter and drop Hive tables and storm, & Kafka.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Experience in data migration from RDBMS to Cassandra. Created data-models for customer data using the Cassandra Query Language.
- Responsible for building scalable distributed data solutions using Hadoop cluster environment with Horton works distribution
- Involved in developing Spark scripts for data analysis in both Python and Scala. Designed and developed various modules of the application with J2EE design architecture.
- Implemented modules using Core Java APIs, Java collection and integrating the modules.
- Experienced in transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers
- Installed Kibana using salt scripts and built custom dashboards that can visualize aspects of important data stored by Elastic search.
- Used File System Check (FSCK) to check the health of files in HDFS and used Sqoop to import data from SQL server toCassandra
- Streaming the transactional datatoCassandra using Spark Streaming/Kafka
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
- ConfigMap and Daemon set files to install File beats on Kubernetes PODS to send the log files to Log stash or Elastic search to monitor the different types of logs in Kibana.
- Created Database in Influx DB also worked on Interface, created for Kafka also checked the measurements on Databases.
- Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.Successfully Generated consumer group lags from Kafka using their API.
- Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka.
- Involved in creating Hive tables, and loading and analyzing data using hive queries.
- Developed multiple MapReduce jobs in java for data cleaning and preprocessing. Loading data from different source(database & files) into Hive using the Talend tool.
- Used Oozie and Zookeeper operational services for coordinating cluster and Scheduling workflows.
- Implemented Flume, Spark, and Spark Streaming framework for real time data processing.
Environment: Hadoop, Python, HDFS, Hive, Scala, MapReduce, Agile, Cassandra, Kafka, Storm, AWS, YARN, Spark, ETL, Teradata, NoSQL, Oozie, Java, Cassandra, Talend, LINUX, Kibana, HBase
Confidential
Spark/Big Data Engineer
Responsibilities:
- Designed a data workflow model to create a data lake in the Hadoop ecosystem so that reporting tools like Tableau can plugin to generate the necessary reports.
- Created Source to Target Mappings (STM) for the required tables by understanding the business requirements for the reports.
- Worked on Snowflake environment to remove redundancy and load real-time data from various data sources into HDFS using Kafka
- Developed PySpark and SparkSQL code to process the data in Apache Spark on Amazon EMR to perform the necessary transformations based on the STMs developed.
- Hive tables were created on HDFS to store the data processed by Apache Spark on the Cloudera Hadoop Cluster in Parquet format.
- Written multiple MapReduce programs in Java for data extraction, transformation, and aggregation from multiple file-formats including XML, JSON, CSV, and other compressed file formats.
- Loading log data directly into HDFS using Flume.
- Leveraged AWS S3 as a storage layer for HDFS.
- Encoded and decoded JSON objects using PySpark to create and modify the data frames in Apache Spark
- Used Bit Bucket as the code repository and frequently used Git commands to clone, push, pull code to name a few from the Git repository
- Hadoop Resource manager was used to monitor the jobs that were run on the Hadoop cluster
- Used Confluence to store the design documents and the STMs
- Meet with business and engineering teams on a regular basis to keep the requirements in sync and deliver on the requirements
- Used Jira as an agile tool to keep track of the stories that were worked on using the Agile methodology
Environment: SPARK, Hive, Pig, Flume Intellij IDE, AWS CLI, AWS EMR, AWS S3, Rest API, shell scripting, Git, Spark, PySpark, SparkSQL
Confidential
Hadoop Developer
Responsibilities:
- Responsible for building scalable distributed data solutions using Hadoop
- Installed and configured Hive, Pig, Sqoop, Flume, and Oozie on the Hadoop cluster
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Developed Simple to complex Map/reduce Jobs using Java programming language that is implemented using Hive and Pig.
- Optimized Map/Reduce Jobs to use HDFS efficiently by using various compression mechanisms
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS, and Extracted the data from MySQL into HDFS using Sqoop
- Used UDFs to implement business logic in Hadoop.
- Used Impala to read, write and query the Hadoop data in HBase.
- Develop programs in Spark to use on the application for faster data processing than standard MapReduce programs.
- Implemented business logic by writing UDFs in Java and used various UDFs from Piggybanks and other sources.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Worked with application teams to install an operating system, Hadoop updates, patches, version upgrades as required
- Installed Oozie workflow engine to run multiple Hive and Pig jobs.
- Experience with Storm for the real-time procession of data.
- Used Solr to navigate through data sets in the HDFS storage.
- Loading log data directly into HDFS using Flume.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Written multiple MapReduce programs in Java for data extraction, transformation, and aggregation from multiple file-formats including XML, JSON, CSV, and other compressed file formats.
Environment: Hadoop, MapReduce, HDFS, Hive, Spark, Pig, Java, SQL, Cloudera Manager, Sqoop, Strom, Solr, Mahout, Flume, Oozie, Java (JDK 1.6), Eclipse
