We provide IT Staff Augmentation Services!

Sr. Azure Data Engineer Resume

2.00/5 (Submit Your Rating)

Atlanta, GA

PROFESSIONAL SUMMARY:

  • 9+ years of IT experience in Analysis, Design, Development, in that 5+ years in Big Data technologies like Spark, Map reduce, Hive Yarn and HDFS including programming languages like Java, and Python.
  • 4+ years of experience in Data warehouse / ETL Developer role.
  • Strong experience using Spark RDD API, Spark Data frame/Dataset API, Advanced SQL, Spark - SQL, ANSI SQL, SQL Database Tuning and Spark ML frameworks for building end to end data pipelines.
  • Done POC on newly adopted technologies like Apache Airflow, Dremio, Snowflake and GitLab.
  • Strong experience building data pipelines and performing large-scale data transformations.
  • In-Depth knowledge in working with Distributed Computing Systems and parallel processing techniques to efficiently deal with Big Data.
  • Firm understanding of Hadoop architecture and various components including HDFS, Yarn, Map reduce, Hive, Pig, HBase, Kafka, Oozie etc.,
  • Strong experience building Spark applications using Pyspark and python as programming language.
  • Good experience troubleshooting and fine-tuning long running spark applications.
  • Extensive hands-on experience tuning spark Jobs.
  • Experienced in working with structured data using HiveQL and optimizing Hive queries.
  • Good experience working with real time streaming pipelines using Kafka and Spark-Streaming.
  • Strong experience working with Hive for performing various data analysis.
  • Detailed exposure with various hive concepts like Partitioning, Bucketing, Join optimizations, Ser-De’s, built-in UDF’s and custom UDF’s.
  • Good experience in automating end to end data pipelines using Oozie workflow orchestrator.
  • Good experience working with Cloudera, Hortonworks, and AWS big data services.
  • Strong experience using and integrating various AWS cloud services like S3, EMR, Glue Metastore, Athena, and Redshift into the data pipelines.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in various domains.
  • Worked on Docker based containers for using Airflow.
  • Expertise in configuring and installation of SQL, SQL advanced Server on OLTP to OLAP systems on from high end to low-end environment.
  • Strong experience in performance tuning & index maintenance.
  • Detailed exposure on Azure tools such as Azure Data Lake, Azure Data Bricks, Azure Data Factory, HDInsight, Azure SQL Server, and Azure DevOps.
  • Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Developing and migrating on-premises databases to Azure Data Lake stores using Azure Data Factory.
  • Proficient knowledge and hand on experience in writing shell scripts in Linux.
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
  • Adequate knowledge and working experience in Agile and Waterfall Methodologies.
  • Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
  • Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.

TECHNICAL SKILLS:

Hadoop Ecosystem: HDFS, SQL, YARN, PIG Latin, Map Reduce, Hive, Sqoop, Spark, Yarn, Zookeeper, Oozie, Kafka, Storm, Flume

Programming Languages: Python, Pyspark, Java, Shell Scripting

Big Data Platforms: Hortonworks, Cloudera

Cloud Platform: Azure (ADF, Azure Analytics, HDInsight’s, ADL, Synapse)

Operating Systems: Linux, Windows, UNIX

Databases: MySQL, HBase, MongoDB, Snowflake, Dremio

Development Methods: Agile/Scrum, Waterfall

IDE’s: PyCharm, IntelliJ, Ambari

Data Visualization: Tableau, BO Reports, Dremio

PROFESSIONAL EXPERIENCE:

Confidential, Atlanta, GA

Sr. Azure Data Engineer

Responsibilities:

  • Contributed to the development of PySpark Data Frames in Azure Data bricks to read data from Data Lake or Blob storage and utilize Spark SQL context for transformation.
  • Experience in Creating, developing, and deploying high-performance ETL pipelines with Pyspark and Azure Data Factory.
  • Developed ETL pipelines in and out of data warehouse using a combination of Python, Dremio and Snowflake. Used SnowSQL to write SQL queries against Snowflake.
  • Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
  • Worked on an Azure copy to load data from an on-premises SQL server to an Azure SQL Data warehouse.
  • Worked on redesigning the existing architecture and implementing it on Azure SQL.
  • Experience with Azure SQL database configuration and tuning automation, vulnerability assessment, auditing, and threat detection.
  • Integration of data storage solutions in spark - especially with Azure Data Lake storage and Blob snowflake storage.
  • Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuous data load.
  • Improving the performance of Hive and Spark tasks.
  • Knowledge with Kimball data modeling and dimensional modeling techniques.
  • Worked on cloud point to identify the best cloud vendor based on a set of strict success criteria.
  • Used Hive queries to analyze huge data sets of structured, unstructured, and semi-structured data.
  • Created Hive scripts from Teradata SQL scripts for data processing on Hadoop.
  • Developed Hive tables to hold processed findings, as well as Hive scripts to convert and aggregate heterogeneous data.
  • Created and utilized sophisticated data types for storing and retrieving data in Hive using HQL.
  • Used structured data in Hive to enhance performance using sophisticated techniques including as bucketing, partitioning, and optimizing self joins.
  • Created a series of technology demos utilizing the Confidential Edison Arduino shield, Azure Event Hub, and Stream Analytics, to show the possibilities of Azure Stream Analytics.

Environment: Azure Data Factory (V2), Azure Data bricks, Python, SSIS, Azure SQL, Azure Data Lake, Azure Blob Storage, Spark 2.0, Hive, Dremio Platform, ANSI SQL, etc.

Confidential, Palo Alto, CA

Azure Data Engineer

Responsibilities:

  • KS by proper troubleshooting, estimation, and monitoring of the clusters.
  • Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.
  • Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.
  • Involved in extraction, transformation and loading of data directly from different source systems (flat files/Excel/Oracle/SQL) using SAS/SQL, SAS/macros.
  • Generated PL/SQL scripts for data manipulation, validation, and materialized views for remote instances.
  • Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and created hive queries for analysis.
  • Created and modified several database objects such as Tables, Views, Indexes, Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.
  • Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset sorting and merging techniques using SAS/Base.
  • Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Used Hive, Impala and Sqoop utilities and Oozie workflows for data extraction and data loading.
  • Created HBase tables to store various data formats of data coming from different sources.
  • Responsible for importing log files from various sources into HDFS using Flume.
  • Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures.
  • Created SSIS packages to migrate data from heterogeneous sources such as MS Excel, Flat files, and CVS files.
  • Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution

Environment: ADF, Databricks and ADL Spark, Hive, HBase, Sqoop, Flume, ADF, Blob, cosmos DB, MapReduce, HDFS, Cloudera, SQL, Apache Kafka, Azure, Python, power BI, Unix, SQL Server.

Confidential, Detroit, MI

Big Data Developer

Responsibilities:

  • Involved in Requirement gathering, Business Analysis and translated business requirements into technical design in Hadoop and Big Data. Involved in SQOOP implementation which helps in loading data from various RDBMS sources to Hadoop systems and vice versa.
  • Developed Python scripts to extract the data from the web server output files to load into HDFS.
  • Written a python script which automates to launch the EMR cluster and configures the Hadoop applications.
  • Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in PySpark.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions, Documented the systems processes and procedures for future s.
  • Involved in Configuring Hadoop cluster and load balancing across the nodes.
  • Involved in Hadoop installation, Commissioning, Decommissioning, Balancing, Troubleshooting, Monitoring and, debugging Configuration of multiple nodes using Hortonworks platform.
  • Involved in working with Spark on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Involved in managing and monitoring Hadoop cluster using Cloudera Manager.
  • Used Python and Shell scripting to build pipelines.
  • Developed data pipeline using Sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
  • Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
  • Assisted in creating and maintaining technical documentation to launching HADOOP Clusters and even for executing Hive queries and Pig Scripts.
  • Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive semi structured and unstructured data. Loaded unstructured data into Hadoop distributed File System (HDFS).
  • Created HIVE Tables with dynamic and static partitioning including buckets for efficiency. Also created external tables in HIVE for staging purposes.
  • Loaded HIVE tables with data, wrote hive queries which run on MapReduce and Created customized BI tool for manager teams that perform query analytics using HiveQL.
  • Aggregated RDDs based on the business requirements and converted RDDs into Data frames saved as temporary hive tables for intermediate processing and stored in HBase/Cassandra and RDBMs.

Environment: Hadoop 3.0, Hive 2.1, J2EE, JDBC, Pig 0.16, HBase 1.1, Sqoop, NoSQL, Impala, Java, Spring, MVC, XML, Spark 1.9, PL/SQL, HDFS, JSON, Hibernate, Bootstrap, jQuery.

Confidential

Data Engineer

Responsibilities:

  • Anchor artifacts for multiple milestone (application design, code development, testing, and deployment) in software lifecycle.
  • Develop Apache Strom program to consume the Alarms in real time streaming from Kafka and enrich the alarm and pass it to EEIM Application.
  • Creating rules Engine in Apache Strom to categorize the alarm into Detection, Interrogation & Association types before processing of alarms.
  • Responsible to develop EEIM Application as Apache Maven project and commit to code to GIT.
  • Analyze the Alarms and enhance the EEIM Application using Apache Strom to predict the root cause of the alarm and exact device where the network failure is happened.
  • Accumulate the EEIM Alarm data to the NoSQL database called Mongo DB and retrieve it from Mongo DB when necessary.
  • Build Fiber to The Neighborhood or Node (FTTN) Topology and Fiber to The Premises (FTTP) Topology using Apache Spark and Apache Hive.
  • Process the system logs using log stash tool and store to elastic search and create dashboard using Kibana.
  • Regularly tune performance of Hive queries to improve data processing and retrieving
  • Provide the technical support for debugging, code fix, platform issues, missing data points, unreliable data source connections and big data transit issues.
  • Developed Java and Python application to call the external REST APIs to retrieve weather, traffic, geocode information.
  • Working Experience on Azure Databricks cloud to organizing the data into notebooks and making it easy to visualize data using dashboards.
  • Worked on managing the Spark Databricks by proper troubleshooting, estimation, and monitoring of the clusters.
  • Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.
  • Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.
  • Worked with Jira, Bit Bucket and source control systems like GiT and SVN and development tools like Jenkins, Artifactory.

Environment: PySpark, MapReduce, HDFS, Sqoop, flume, Kafka, Hive, Pig, HBase, SQL, Shell Scripting, Eclipse, SQL Developer, Git, SVN, JIRA, Unix.

Confidential

Data Warehouse Developer

Responsibilities:

  • Anchor artifacts for multiple milestone (application design, code development, testing, and deployment) in software lifecycle.
  • Develop Apache Strom program to consume the Alarms in real time streaming from Kafka and enrich the alarm and pass it to EEIM Application.
  • Creating rules Engine in Apache Strom to categorize the alarm into Detection, Interrogation & Association types before processing of alarms.
  • Responsible to develop EEIM Application as Apache Maven project and commit to code to GIT.
  • Analyze the Alarms and enhance the EEIM Application using Apache Strom to predict the root cause of the alarm and exact device where the network failure is happened.
  • Accumulate the EEIM Alarm data to the NoSQL database called Mongo DB and retrieve it from Mongo DB when necessary.
  • Build Fiber to The Neighborhood or Node (FTTN) Topology and Fiber to The Premises (FTTP) Topology using Apache Spark and Apache Hive.
  • Process the system logs using log stash tool and store to elastic search and create dashboard using Kibana.
  • Regularly tune performance of Hive queries to improve data processing and retrieving
  • Provide the technical support for debugging, code fix, platform issues, missing data points, unreliable data source connections and big data transit issues.
  • Developed Java and Python application to call the external REST APIs to retrieve weather, traffic, geocode information.
  • Working Experience on Azure Databricks cloud to organizing the data into notebooks and making it easy to visualize data using dashboards.
  • Worked on managing the Spark Databricks by proper troubleshooting, estimation, and monitoring of the clusters.
  • Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.
  • Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.
  • Worked with Jira, Bit Bucket and source control systems like GiT and SVN and development tools like Jenkins, Artifactory.

Environment: PySpark, MapReduce, HDFS, Sqoop, flume, Kafka, Hive, Pig, HBase, SQL, Shell Scripting, Eclipse, SQL Developer, Git, SVN, JIRA, Unix.

We'd love your feedback!