Azure Data Engineer Resume
3.00/5 (Submit Your Rating)
Chicago, IL
SUMMARY
- Around 8 years of professional experience in full Software Development Life Cycle (SDLC), Agile Methodology and analysis, design, development, testing, implementation and maintenance in Azure, Azure Databricks (SPARK), Hadoop, Data Warehousing, Linux and Scala/Python
- Experience in Bigdata related infrastructure like Hadoop frameworks, Spark Ecosystem, HDFS, Map Reduce, Hive, Storm, Kafka, YARN, HBase, Oozie, Zookeeper, Flume and Sqoop
- Strong experience on Hadoop distributions like Cloudera and Hortonworks
- Experience in Developing Spark jobs using Scala for faster real - time analytics and used Spark SQL for querying
- Hands on experience with Azure Cloud Services like Blob Storage,
- Migrating the data from Oracle, MS SQL Server into HDFS using Sqoop and importing various formats of flat files into HDFS
- Experience in importing and exporting data into HDFS and Hive using Sqoop
- Strong knowledge of core Spark components including - RDDs, Data frame and Dataset APIs, Data Streaming, in memory capabilities, DAG scheduling, data partitioning and tuning
- Performed various optimizations like using distributed cache for small datasets, partition and bucketing in hive
- Expertise in Developing Spark application using PySpark and Spark Streaming API's in Python, deploying in yarn cluster in client, cluster mode. Used Spark Data frame APIs to ingest data from HDFS.
- Involved in converting HBase/Hive/SQL queries into Spark transformations using RDDs, and Python, Scala
- Experience developing iterative Algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Experience with migrating data to and from RDBMS and unstructured sources into HDFS using Sqoop.
- Excellent Programming skills at a higher level of abstraction using Scala and Python.
- Experience in job workflow scheduling and monitoring tools like Oozie and good knowledge on Zookeeper.
- Profound understanding of Partitions and Bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Experience in writing real time query processing using Cloudera Impala.
- Strong working experience in planning and carrying out of Teradata system extraction using Informatica, Loading Process and Data warehousing, Large-scale Database Management and Reengineering.
- Highly experienced in creating complex Informatica mappings and workflows working with major transformations.
- In depth understanding of Apache spark job execution Components like DAG, lineage graph, Dag Scheduler, Taskscheduler, Stages and Worked on NoSQL databases including HBase and Mongo DB.
- Experienced with performing CRUD operations using HBase Java Client API and Solr API.
PROFESSIONAL EXPERIENCE
Azure Data Engineer
Confidential, Chicago, IL
Responsibilities:
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Hadoop cluster.
- Used Zeppelin, Jupyter notebooks and Spark-Shell to develop, test and analyze Spark jobs before Scheduling Customized Spark jobs.
- Undertake data analysis and collaborated with down-stream, analytics team to shape the data according to their requirement.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- To meet specific business requirements wrote UDF’s in Scala and Store procedures
- Replaced the existing MapReduce programs and Hive Queries into Spark application using Scala
- Deployed and tested (CI/CD) our developed code using Visual Studio Team Services (VSTS)
- Conducting code reviews for team members to ensure proper test coverage and consistent code standards.
- Responsible for documenting the process and cleanup of unwanted data
- Responsible for Ingestion of Data from Blob to Kusto and maintaining the PPE and PROD pipelines.
- Expertise in creating HDInsight cluster and Storage Account with End-to-End environment for running the jobs
- Developed Json Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the Cosmos Activity
- Hands-on experience on developing PowerShell Scripts for automation purpose
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS)
- Experience in using Scala Test Fun Suite Framework for developing Unit Tests cases and Integration testing.
- Hands on experience in working on Spark SQL queries, Data frames, and import data from Data sources, perform transformations; perform read/write operations, save the results to output directory into HDFS.
- Involved in running the Cosmos Scripts in Visual Studio 2017/2015 for checking the diagnostics
- Worked in Agile development environment in sprint cycles of two weeks by dividing and organizing tasks
Environment: Azure Cloud Services, Databricks, Blob Storage, ADF, Azure SQL Server, HDFS, Pig, Hive, Spark, Kafka, IntelliJ, Cosmos, Sbt, Zeppelin, YARN, Scala, SQL, Git
Big Data Engineer
Confidential
Responsibilities:
- Design and develop ETL integration patterns using Python on Spark.
- Optimize the Pyspark jobs to run on Secured Clusters for faster data processing.
- Developed Spark scripts by using Python and Scala shell commands as per the requirement.
- Used Python for SQL/CRUD operations in DB, file extraction/transformation/generation.
- Developed spark applications in Python (PySpark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables.
- Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw zone.
- Analyzed the user requirements and implemented the use cases using Apache NiFi.
- Designed and developed enterprise scale cloud alert mechanism using azure data bricks, Spark/Spark UI data processing framework (Python/Scala) and azure Data Factory. Built data pipelines to transform, aggregate and process data using Azure Databricks, Azure ADLS, Blob, Azure Delta and Airflow.
- Migrated some of the existing pipelines to Azure Data bricks using PySpark Notebooks for analytical team.
- Experience with POC that involves scripting using PySpark in Azure Data bricks.
- Exposure to Apache Data Bricks to generate scripts in Py-Spark to automate the reports.
- Worked on reading and writing multiple data formats like JSON, ORC, Parquet on HDFS using PySpark.
- Excellent working with Data modeling tools like Erwin, Power Designer and ER Studio.
- Have experience of working on Snow -flake data warehouse.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Designed custom Spark REPL application to handle similar datasets.
- Used Hadoop scripts for HDFS (Hadoop File System) data loading and manipulation.
- Performed Hive test queries on local sample files and HDFS files.
- Developed Hive queries to analyze data and generate results.
- Used Spark Streaming to divide streaming data into batches as an input to Spark Engine for batch processing.
- Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, hive, HBase, Spark and Sqoop.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization, and user report generation.
- Used Scala to write code for all Spark use cases.
- Analyzed user request patterns and implemented various performance optimization measures including implementing partitions and buckets in HiveQL.
- Assigned name to each of the columns using case class option in Scala.
- Worked on migrating Map Reduce programs into Spark transformations using Spark and Scala, initially done using Python (PySpark).
- Involved in converting the hql’s in to spark transformations using spark RDD with support of Python and Scala.
- Developed multiple Spark SQL jobs for data cleaning.
- Created Hive tables and worked on them using Hive QL.
- Assisted in loading large sets of data (Structure, Semi Structured, and Unstructured) to HDFS.
- Developed Spark SQL to load tables into HDFS to run select queries on top.
- Developed analytical component using Scala, Spark, and Spark Stream.
- Used Visualization tools such as Power view for excel, Tableau for visualizing and generating reports.
- Worked on the NoSQL databases HBase and mongo DB.
- Using Sqoop to import and export data from Oracle and PostgreSQL into HDFS to use it for the analysis.
- Migrated Existing MapReduce programs to Spark Models using Python.
- Migrating the data from Data Lake (hive) into S3 Bucket.
- Used Spark Data Frame API over Cloudera platform to perform analytics on hive data.
- Designed batch processing jobs using Apache Spark to increase speeds by ten-fold compared to that of MR jobs.
- Used Kafka for real time data ingestion.
- Created different topic for reading the data in Kafka.
- Read data from different topics in Kafka.
- Involved in converting the hql’s in to spark transformations using Spark RDD with support of python and Scala.
- Data Warehouse for generating the reports.
- Written Hive queries for data analysis to meet the business requirements.
- Developed PIG Latin scripts to extract the data from the web server output files and to load into HDFS.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Created many Sparks UDF and UDAFs in Hive for functions that were not preexisting in Hive and Spark SQL.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Implementing different performance optimization techniques such as using distributed cache for small datasets, partitioning, and bucketing in hive, doing map side joins etc.
- Good knowledge on Spark platform parameters like memory, cores, and executors
- By using Zookeeper implementation in the cluster, provided concurrent access for Hive Tables with shared and exclusive locking.
- Experience in moving data between GCP and Azure using Azure Data Factory.
- Experience in building power bi reports on Azure Analysis services for better performance.
- Experience in Amazon Elastic MapReduce and CDC(Change data Capture)
- Used cloud shell SDK in GCP to configure the services Data Proc, Storage, Big Query.
- Experience with designing, building, and operating solutions using virtualization using private hybrid/public cloud technologies.
- Reviewed the HDFS usage and system design for future scalability and fault-tolerance. Installed and configured Hadoop HDFS, MapReduce, Pig, Hive, Sqoop.
- Wrote Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS.
- Processed HDFS data and created external tables using Hive, in order to analyze visitors per day, page views and most purchased products.
- Exported analyzed data to HDFS using Sqoop for generating reports.
Environment: MapReduce, Hive, Pig, Sqoop, Oracle, MapR, Informatica, Microstrategy, Cloudera, Manager, Oozie, ZooKeeper. Hadoop, Hive, Oozie, Java, Linux, Maven, Apache NiFi, Oracle 11g/10g, Zookeeper, MySQL, Spark, Data Bricks.