We provide IT Staff Augmentation Services!

Azure Data Engineer Resume

0/5 (Submit Your Rating)

Chandler, AZ

SUMMARY

  • Around 8+ years of professional experience in the IT industry including Designing, Developing, Analysis, and Big data in SPARK, Hadoop, Pig and HDFS environment and well - versed in a variety of programming languages, big data technologies, and data governance tools.
  • Working knowledge of migrating SQL databases to Azure data lake, Azure data lake analytics, Azure SQL Database, Data Bricks, and Azure SQL Data warehouse, and controlling and granting database access and migrating on-premise databases to Azure Data Lake storage using Azure Data factory.
  • Worked on various projects related to Data Modeling, System/Data Analysis, Design, and Development both for OLTP and Data Warehouse applications.
  • Experience with AWS services like S3, Athena, Redshift Spectrum, Redshift, EMR, Glue, Data pipeline, step functions, cloud watch, SNS, and Cloud formation.
  • Experience with Snowflake cloud data warehouse and AWS S3 bucket for integrating data from multiple source systems, including loading nested JSON formatted data into Snowflake table.
  • Extensive experience in developing Spark applications using Spark - SQL in Databricks to extract, transform, and aggregate data from multiple formats for analyzing and transforming the data to uncover customer insights into usage patterns.
  • Improved efficiency of existing algorithms by working with Spark using Spark Context, Spark SQL, Spark MLlib, Data Frame, and Pair RDDs.
  • Experience integrating DBT with cloud-based data warehouses such as Snowflake, Redshift, or BigQuery.
  • Experience working with data modeling tools like Erwin, Power Designer, and ER/Studio.
  • In-depth knowledge and Experience with tools in the Hadoop Ecosystem, including Pig, Hive, HDFS, MapReduce, Sqoop, Spark, Kafka, Yarn, Oozie, ZooKeeper as well as Hadoop architecture and components.
  • Great experience and extensive knowledge on Amazon Web Services (AWS) EC2, S3, Elastic Map Reduce (EMR) and on Snowflake, Redshift, Identity and Access Management (IAM)
  • Implementing standards and security features for Authentication for Snowflake and worked on Snowflake access controls and architecture design.
  • Automated OpenStack and AWS deployment using Cloud Formation, Ansible, Chef, and Terraform.
  • Developed diverse and complex MapReduce and Streaming jobs using Scala and Java for data cleansing, filtering, and aggregation, and possess a comprehensive understanding of MapReduce.
  • Experience using Fivetran SQL to build and optimize data models within a data warehouse, including creating tables, indexes, and views.
  • Expert in importing and exporting data between HDFS and Relational Systems such as MySQL and Teradata using Sqoop.
  • Designed and Developed/Reviewed ETL programs primarily using Matillion, and Python.
  • Extensive experience with T-SQL in constructing Triggers, Tables, implementing stored Procedures, Functions, Views, User Profiles, Data Dictionaries, and Data Integrity.
  • Experience building ETL pipelines to load data into Data Vault models from various source systems.
  • Expertise with the Big-data database HBase and NoSQL databases MongoDB and Cassandra.
  • Working with different working strategies like Agile, Waterfall, and Scrum methodologies.
  • Good Knowledge in Identifying and Resolving Snowflake DB/data issues.
  • Experienced with the cloud: Hadoop-on-Azure, AWS/EMR, Cloudera Manager.
  • Working knowledge of ETL (Extraction, Transformation, and Loading) data from various sources into Data Warehouses, as well as data processing (such as collecting, aggregating, and moving data from various sources) using Apache Flume, Kafka, Power BI, and Microsoft SSIS.
  • Working knowledge of Agile methodology and experience using Jira for Sprints and issue tracking.
  • Good experience and knowledge with Map Reduce Programs, Pig Scripts, and Hive commands to achieve the best results and good understanding and exposure to Python programming.
  • Experience with ETL tools such as Informatica, Talend, or DataStage, for extracting data from various source systems and loading it into the EDW.
  • An in-depth understanding of Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node, Data Node, and MapReduce programming.
  • A strong understanding of Apache Spark Streaming APIs and Expertise in writing Apache Spark Streaming API on Big Data distribution in the active cluster environment.
  • Working Experience in Developing Spark RDD and Spark Data Frame API for Distributed Data Processing.
  • Experience building highly reliable, scalable big data solutions on Hadoop distributions Cloudera, Horton works, and AWS.

TECHNICAL SKILLS

Big Data Technologies: HDFS, Map Reduce, Spark, Spark SQL, HBase, Kafka, YARNHDFS, Hive, Pig, Sqoop.

Hadoop Distribution: Apache Hadoop, Cloudera Hortonworks

Databases: HBase, Cassandra, Oracle, PostgreSQL, Teradata, SQL Server, Oracle 9i/10g, DB2, MySQL 4.x/5.x

Data Services: Hive, Pig, Impala, Sqoop, Flume, Kafka.

Cloud Technologies: AWS, Azure, Azure Databricks, EC2, EC3, Snowflake, EMR, Glue, Redshift & MS Azure.

Programming Languages: C, Scala, Python, R, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Shell Scripting.

Operating Systems: UNIX, Windows, LINUX, and ZOS

PROFESSIONAL EXPERIENCE

Confidential, CHANDLER, AZ

AZURE DATA ENGINEER

RESPONSIBILITIES:

  • Experienced in building Data Warehouse in Azure platform using Azure data bricks and Data factory.
  • Proficient and Expert in Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL.
  • Pipelines were created in ADF using Linked Services/Datasets/Pipeline to extract, transform, and load data from different sources such as Azure SQL, Blob Storage, Azure SQL Data warehouses, write-back tools and backwards compatibility.
  • Analyzed Azure Data Factory and Azure Data Bricks to build a new ETL process in Azure.
  • Created several Databricks Spark jobs with PySpark to perform several tables-to-table operations.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks.
  • Developed and Implemented Spark applications using PySpark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Experience in Database Design and development with Business Intelligence using SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema, and Snowflake Schema.
  • Heavily involved in testing Snowflake to understand best possible way to use the cloud resources.
  • Consulting on Snowflake Data Platform Solution Architecture, Design, Development, and deployment focused to bring the data driven culture across the enterprises.
  • Involved in Identifying and Resolving Snowflake DB/data issues and Created pipeline jobs, and scheduled triggers using Azure data factory.
  • Utilized U-SQL for data analytics/data ingestion of raw data in Azure and Blob storage.
  • Configured and deployed Azure Automation scripts for applications utilizing the Azure stack that including compute, blobs, Azure Data Lake, Azure Data Factory, Azure SQL, Cloud services, ARM Templates and utilities focusing on Automation involved in converting Hive/SQL queries into SPARK transformations using Spark RDDs, and Scala.
  • Deep Experience with Snowflake Multi - Cluster Warehouses and Great Understanding of Snowflake cloud technology.
  • Worked with Azure Stream Analytics and Azure Databricks for real-time analytics and visualized it with Power BI and Facilitated data for interactive Power BI dashboards and reporting purposes.
  • Created several Databricks Spark jobs with PySpark to perform several tables-to-table operations.
  • Experience with DBT's testing framework and writing automated tests for data transformations.
  • Worked closely with a team of Data analysts and made sure to cater to their data requirements.
  • Interacted with data residing in HDFS using PySpark to process the data.
  • Worked with stakeholders in building Tableau dashboards and Alteryx workflows to prep data so the data analysts can use it to build Tableau dashboards.
  • Experience with BI tools such as Tableau or Power BI, for creating reports and visualizations based on data in the EDW.
  • Used Hive Context which provides a superset of the functionality provided by SQLContext and Preferred to write queries using the HiveQL parser to read data from Hive tables (fact, syndicate).
  • Used Pig to do transformations, event joins, filter boot traffic and some pre-aggregations before storing the data onto Azure database.
  • Experience optimizing DBT pipelines for performance, including tuning SQL queries and optimizing data pipeline orchestration.
  • Developed Spark scripts by writing custom RDDs in Scala for data transformations and performing actions on RDDs.
  • Installed and configured HDFS, PIG, HIVE, Kafka, Hadoop and MapReduce.
  • Created Hive Fact tables on top of raw data from different retailer’s which partitioned by Time dimension key, Retailer name, Data supplier name which further processed pulled by analytics service engine.
  • Used advanced features of T-SQL in order to design and tune T-SQL to interface with the Database and other applications in the most efficient manner and created stored Procedures for the business logic using T-SQL.
  • Successfully loading files to Hive and HDFS from Oracle, SQL Server using SQOOP.
  • Extracted Real-time feed using Kafka and Spark Streaming and convert it to RDD and processed data in the form of Data Frame and save the data as Parquet format in HDFS.

Environment: Azure Data Factory, Data Bricks, Azure Data Lake, Snowflake, Python, Spark, Hive, HDFS, Kafka, Data Warehousing, Data Preparation, ETL, Agile, MS-SQL, PySpark, Eclipse, Tableau, Alteryx.

Confidential, ATLANTA

AWS DATA ENGINEER

RESPONSIBILITIES:

  • Extensively worked with AWS cloud Platform (EC2, S3, EMR, Redshift, Lambda, and Glue).
  • Experienced in implementing the Data warehouse on AWS Redshift.
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high- availability, fault tolerance, and auto-scaling in AWS Cloud Formation.
  • Used Cloud EC2 instances to execute the ETL jobs and publish data to S3 buckets for external vendors.
  • Worked on AWS Lambda and implemented Lambda step-up functions.
  • Evaluated Fivetran and Matillion for streaming and batch data ingestion into Snowflake.
  • Developed ETL pipelines in and out of data warehouse using a combination of Python and Snowflakes SnowSQL Writing SQL queries against Snowflake.
  • Evaluated Snowflake Design considerations for any change in the application
  • Defined virtual warehouse sizing for Snowflake for different type of workloads.
  • Built the Logical and Physical data model for Snowflake as per the changes required.
  • Deep Experience using Fivetran SQL to create stored procedures, functions, and triggers to automate common data management tasks.
  • Written Templates for AWS infrastructure as a code using Terraform to build staging and production environments and defined Terraform modules such as Compute, Network, Operations, and Users to reuse in different environments.
  • Created, modified, and executed DDL in table AWS Redshift and Snowflake tables to load data.
  • Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building a common learner data model which gets the data from Kafka in real-time and persists it to Cassandra.
  • Expertise in Creating, Debugging, Scheduling, and Monitoring jobs using Airflow for ETL batch processing to load into Snowflake for analytical processes.
  • Extensive experience on Terraform to map complex dependencies and identify network issues and implement Terraform key features such as Infrastructure as code, Execution plans, Resource Graphs, Change Automation.
  • Worked on Microservices for Continuous Delivery environment using Docker and Jenkins.
  • Developed Kafka consumer API in Python for consuming data from Kafka topics.
  • Wrote standard complex T-SQL Queries to perform data validation and graph validation to make sure test results matched back to expected results based on business requirements.
  • Developed Pre-processing job using Spark Data frames to flatten JSON documents to flat file.
  • Loaded DStream data into Spark RDD and do in-memory data computation to generate output response.
  • Experienced in writing live Real-time processing and core jobs using Spark Streaming with Kafka as a Data pipeline system.
  • Created and managed a Docker deployment pipeline for custom application images in the cloud using Jenkins.
  • Migrated an existing on-premises application to AWS and Used AWS services like EC2 and S3 for data sets processing and storage.
  • Used JSON schema to define table and column mapping from S3 data to Redshift.
  • Created PySpark frame to bring data from DB2 to S3 and Worked on maintaining the Hadoop cluster on AWS EMR.
  • Loaded data into S3 buckets using AWS Glue and PySpark. Involved in filtering data stored in S3 buckets using Elastic search and loaded data into Hive external tables.
  • Worked on BI tools as Tableau to create dashboards like weekly, monthly, daily reports using Tableau desktop and publish them to HDFS cluster.
  • Automated the ETL jobs using Airflow as an Orchestrator and Created numerous ODI interfaces to load data into Snowflake DB.
  • Used Amazon Web Services (AWS) S3 to store large amounts of data in identical/similar repositories.
  • Designed and implemented distributed data processing pipelines using Apache Spark, Hive, Python, Airflow, and other tools and languages in Hadoop Ecosystem.
  • Involved in migrating Spark Jobs from Qubole to Databricks.
  • Developed solutions for import/export of data from Teradata, Oracle to HDFS, S3, and S3 to snowflake.
  • Resolved Spark and Yarn resource management issues in Spark including Shuffle issues, Out of Memory issues, heap space errors and schema compatibility.
  • Imported and exported the data using Sqoop from or to HDFS and Relational DB Oracle and Netezza.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark Data Frames, Spark RDD.

Environment: Hadoop, Sqoop, MapReduce, SQL, Teradata, AWS, EMR, Redshift, Snowflake, Hive, Pig, SQL, Kafka, Glue, HBase, Apache Airflow.

Confidential, DENVER, COLORADO

SR. DATA ENGINEER

RESPONSIBILITIES:

  • Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
  • Worked with different feeds data like JSON, CSV, XML, DAT and implemented the Data Lake concept.
  • Developed Informatica design mappings using various transformations.
  • Used AWS Lambda to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
  • Developed the PySpark code for AWS Glue jobs and for EMR.
  • Using Spark, performed various transformations and actions and the final result data is saved back to HDFS from there to target database Snowflake.
  • Written and executed several complex SQL queries in AWS glue for ETL operations in Spark data frame using SparkSQL.
  • Defined AWS Lambda functions for making changes to Amazon S3 buckets and updating the Amazon DynamoDB table.
  • Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark framework, etc.)
  • Use of Docker and Kubernetes to manage microservices for the development of continuous integration and continuous delivery.
  • Extensive experience in developing Stored Procedures, Functions, Views and Triggers, and Complex SQL queries using SQL Server, TSQL, and Oracle PL/SQL.
  • Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events.
  • Responsible for data gathering from multiple sources like Teradata, and Oracle and also created Hive tables to store the processed results in a Tabular format.
  • Architecture and Hands-on production implementation of the big data MapR Hadoop solution for Digital Media Marketing using Telecom Data, Shipment Data, Point of Sale (POS), exposure and advertising data related to Consumer Product Goods.
  • Created Dataframes from different data sources like Existing RDDs, Structured data files, JSON Datasets, Hive tables, and External databases.
  • Programmed ETL functions between Oracle and Amazon Redshift.
  • Loaded terabytes of different level raw data into Spark RDD for data Computation to generate the Output response.
  • Extensively worked on tuning Informatica mappings/sessions/workflows for better performance.
  • Python was used in the automation of Hive and Reading Configuration files.
  • Configured Hive meta store with MySQL, which stores the metadata for Hive tables.
  • Designed and developed ETL Mappings to extract data from flat files to load the data into the target database.
  • Environment, utilizing Kubernetes and Docker for the runtime environment for the CI/CD system to build and test and deploy.
  • Involved in Spark for fast processing of data. Used both Spark Shell and Spark Standalone cluster.
  • Using Hive to analyze the partitioned data and compute various metrics for reporting.

Environment: Map Reduce, HDFS, Hive, Python, Scala, Kafka, Spark, Spark Sql, Oracle, Informatica 9.6, SQL, Sqoop, Zookeeper, AWS EMR, AWS S3, Data Pipeline, Jenkins, GIT, JIRA, Unix/Linux, Agile.

Confidential

DATA ANALYST/ENGINEER

RESPONSIBILITIES:

  • Responsible for building scalable distributed data solutions using Hadoop Cluster environment with Hortonworks distribution.
  • Converted raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Worked on building end to end data pipelines on Hadoop Data Platforms.
  • Designed, developed, and tested Extract Transform Load (ETL) applications with different types of sources.
  • Worked on Full life cycle development (SDLC) involved in all stages of development.
  • Developed database triggers, packages, functions, and stored procedures using PL/SQL and maintained the scripts for various data feeds.
  • Created different types of reports in SSRS such as parameterized and cascaded parameterized reports.
  • Migrated data from Heterogeneous Data Sources and legacy systems (DB2, Access, Excel) to centralized SQL Server databases using SQL Server Integration Services (SSIS) to overcome transformation constraints.
  • Creating files and tuning the SQL queries in Hive Utilizing HUE and Implemented MapReduce jobs in Hive by querying the available data.
  • Modified the existing data model for the data hub and info center and generated DDL commands for deployment using Erwin.
  • Explored with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's.
  • Experience with PySpark for using Spark libraries by using Python scripting for data analysis.
  • Prepared and implemented successfully automated UNIX scripts to execute the end-to-end history load process.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and Scala programming.
  • Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
  • Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.

Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, SSRS, SSIS, Web Services, Linux RedHat, Unix.

We'd love your feedback!