We provide IT Staff Augmentation Services!

Azure Data Engineer Resume

5.00/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • 8 years of IT experience in Analysis, Design, Development in Big Data technologies like Spark, MapReduce, Hive, Yarn and HDFS including programming languages like Java, Scala and Python. 4years in Data warehouse developer and 5years of experience in Data Engineer.
  • In - Depth knowledge in working with Distributed Computing Systems and parallel processing techniques to efficiently deal with Big Data.
  • Hands-on experience with Amazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Watch, SNS, SES, SQS, Lambda, EMR and other services of the AWS family.
  • Experience in working with Cloudera (CDH4 & 5), and Hortonworks and AWS Amazon EMR, Lambda, Kinesis data stream to fully leverage and implement new Hadoop features.
  • Worked on AWS SAAS, PAAS, Hybrid Cloud and Google cloud platforms
  • Experience with Python, SQL on AWS cloud platform, better understanding of Data Warehouses like Snowflake and Data-bricks platform, etc.
  • Firm understanding of Hadoop architecture and various components including HDFS, Yarn, Mapreduce, Hive, Pig, HBase, Kafka, Oozie etc.,
  • Strong experience building Spark applications using python as programming language.
  • Good experience troubleshooting and fine-tuning long running spark applications.
  • Strong experience using Spark RDD Api, Spark Dataframe/Dataset Api, Spark-SQL and Spark ML frameworks for building end to end data pipelines.
  • Good experience working with real time streaming pipelines using Kafka and Spark-Streaming.
  • Strong experience working with Hive for performing various data analysis.
  • Good experience in automating end to end data pipelines using Oozie workflow orchestrator.
  • Strong experience of leading multiple Azure Big Data and Data transformation implementations in Health domain.
  • Worked on Docker based containers for using Airflow.
  • Expertise in configuring and installation of PostgreSQL, Postgres plus advanced Server on OLTP to OLAP systems on from high end to low end environment.
  • Detailed exposure on Azure tools such as Azure Data Lake, Azure Data Bricks, Azure Data Factory, HDInsight, Azure SQL Server, Azure DevOps.
  • Knowledge in Setup and maintenance of Postgres master - slave clusters utilizing streaming replication.
  • Exposure to Cloudera Installation on Azure Cloud instances
  • Experience in implementing OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse, Azure Data Factory, ADLS, Databricks, SQL DW.
  • Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.
  • Excellent understanding of NOSQL databases like HBASE, Cassandra, MongoDB.
  • Proficient knowledge and hand on experience in writing shell scripts in Linux.
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python technologies.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
  • Adequate knowledge and working experience in Agile and Waterfall Methodologies.
  • Defining user stories and driving the agile board in JIRA during project execution, participate in sprint demo and retrospective.
  • Done POC on newly adopted technologies like Apache Airflow and Snowflake and Gitlab.
  • Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.

PROFESSIONAL EXPERIENCE

Azure Data Engineer

Confidential, Charlotte, NC

Responsibilities:

  • Used Agile Methodology of Data Warehouse development using Kanbanize.
  • Developed data pipeline using Spark, Hive and HBase to ingest customer behavioral data and financial histories into Hadoop cluster for analysis.
  • Working Experience on Azure Databricks cloud to organizing the data into notebooks and making it easy to visualize data using dashboards.
  • Performed ETL on data from different source systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
  • Worked on managing the Spark Databricks azure
  • Implemented data ingestion from various source systems using sqoop and PySpark.
  • Hands on experience implementing Spark and Hive jobs performance tuning.
  • KS by proper troubleshooting, estimation, and monitoring of the clusters.
  • Performed Data Aggregation, Validation and on Azure HDInsight using spark scripts written in Python.
  • Performed monitoring and management of the Hadoop cluster by using Azure HDInsight.
  • Involved in extraction, transformation and loading of data directly from different source systems (flat files/Excel/Oracle/SQL) using SAS/SQL, SAS/macros.
  • Generated PL/SQL scripts for data manipulation, validation, and materialized views for remote instances.
  • Created partitioned tables in Hive, also designed a data warehouse using Hive external tables and also created hive queries for analysis.
  • Created and modified several database objects such as Tables, Views, Indexes,Constraints, Stored procedures, Packages, Functions and Triggers using SQL and PL/SQL.
  • Created large datasets by combining individual datasets using various inner and outer joins in SAS/SQL and dataset sorting and merging techniques using SAS/Base.
  • Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
  • Wrote Python scripts to parse XML documents and load the data in database.
  • Used Python to extract weekly information from XML files.
  • Integrated Nifi with Snowflake to optimize the client session running.
  • Used Hive, Impala and Sqoop utilities and Oozie workflows for data extraction and data loading.
  • Performed File system management and monitoring on Hadoop log files.
  • Used Spark API over Hadoop YARN to perform analytics on data in Hive.
  • Created stored procedures to import data in to Elasticsearch engine.
  • Used Spark SQL to process huge amount of structured data to aid in better analysis for our business teams.
  • Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
  • Created HBase tables to store various data formats of data coming from different sources.
  • Responsible for importing log files from various sources into HDFS using Flume.
  • Worked on SAS Visual Analytics & SAS Web Report Studio for data presentation and reporting.
  • Extensively used SAS/Macros to parameterize the reports so that the user could choose the summary and sub-setting variables to be used from the web application.
  • Responsible for translating business and data requirements into logical data models in support Enterprise data models, ODS, OLAP, OLTP and Operational data structures.
  • Created SSIS packages to migrate data from heterogeneous sources such as MS Excel, Flat files and CVS files.
  • Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution

Environment: Azure Data Factory (V2), Azure Databricks, Pyspark, Snowflake, Azure SQL, Azure Data Lake, Azure Blob Storage, Azure ML

Data Engineer

Confidential, Weehawken, NJ

Responsibilities:

  • Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
  • Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, Dynamo DB.
  • Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig and Map Reduce to access cluster for new users.
  • Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
  • Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon Dynamo DB.
  • Used Spark SQL for Scala & amp, Python interface that automatically converts RDD case classes to schema RDD.
  • Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using PySpark to generate the output response.
  • Creating Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
  • Importing & exporting database using SQL Server Integrations Services (SSIS) and Data Transformation Services (DTS Packages).
  • Coded Teradata BTEQ scripts to load, transform data, fix defects like SCD 2 date chaining, cleaning up duplicates.
  • Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
  • Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
  • Developed Kibana Dashboards based on the Log stash data and Integrated different source and target systems into Elastic search for near real time log analysis of monitoring End to End transactions.
  • Integrated Apache Airflow with AWS to monitor multi-stage ML workflows with the tasks running on Amazon Sage Maker.

Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, Dynamo DB, Amazon Sage Maker, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, Map Reduce, Snowflake, Apache Pig, Python, SSRS, Tableau

Data Engineer

Confidential, St.Louis, MO

Responsibilities:

  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Created the automated build and deployment process for application, application setup for better user experience, and leading up to building a continuous integration system.
  • Worked on developing PySpark script to encrypting the raw data by using Hashing algorithms concepts on client specified columns.
  • Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
  • Developed data pipeline using Spark, Hive, python to ingest customer
  • Worked on migrating Map Reduce programs into Spark transformations using Python.
  • Worked on Spark Data sources (Hive, JSON files, Spark Data frames, Spark SQL and Streaming using Python.
  • Developed Spark scripts by writing custom RDDs in Python for data transformations and perform actions on RDDs.
  • Implemented Kafka, spark structured streaming for real time data ingestion.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using Map Reduce programs.
  • Involved in designing optimizing Spark SQL queries, Data frames, import data from Data sources, perform transformations and stored the results to output directory into AWS S3.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Created spark jobs to apply data cleansing/data validation rules on new source files in inbound bucket and reject records to reject-data S3 bucket.
  • Involved in converting Hive/SQL queries into transformations using Python and performed complex joins on tables in hive with various optimization techniques.
  • Created Hive tables as per requirements, internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
  • Worked extensively with HIVE DDLS and Hive Query language(HQLs)
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in Map Reduce way.
  • Designed and implemented data loading and aggregation frameworks and jobs that will be able to handle hundreds of GBs of json files, using Spark, Airflow and Snowflake.
  • Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Spark.
  • Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
  • Develop Kafka producer and consumers, HBase clients, Spark jobs using Python along with components on HDFS, Hive.
  • Created Kafka producer API to send live-stream data into various Kafka topics and developed Spark- Streaming applications to consume the data from Kafka topics and to insert the processed streams to HBase.
  • Extracted, transformed and loaded data from various heterogeneous data sources and destinations using AWS.
  • Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the processing.
  • Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart and refresh strategies.
  • Worked in Production Environment which involves building CI/CD pipeline using Jenkins with various stages starting from code checkout from GitHub to Deploying code in specific environment.
  • Developed AWS cloud formation templates and setting up Auto scaling for EC2 instances and involved in the automated provisioning of AWS cloud environment using Jenkins.
  • Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.

Environment: Spark, Hive, HBase, Sqoop, Flume, ADF, Blob, cosmos DB, Map Reduce, HDFS, Cloudera, SQL, Apache Kafka, AWS, S3, Kubernetes, Python, Unix.

Hadoop Developer

Confidential

Responsibilities:

  • Installed and configured Hive, HDFS and the NIFI, implemented HDP Hadoop cluster. Assisted with performance tuning and monitoring.
  • Involved in loading and transforming large sets of structured data from router location to EDW using a NIFI data pipeline flow.
  • Developed PySpark code and Spark-SQL for faster testing and processing of data.
  • Worked on Data serialization formats for converting complex objects into sequence bits by using Parquet, ORC, AVRO, JSON and CSV formats.
  • Created Hive tables to load large data sets of structured data coming from WADL after transformation of raw data.
  • Created reports for the BI team using SQOOP to export data into HDFS and Hive.
  • Developed custom NIFI processors for parsing the data from XML to JSON format and filter broken files.
  • Created Hive queries to spot trends by comparing fresh data with EDW reference tables and historical metrics.
  • Used PySpark to convert panda’s data frame to Spark Data frame.
  • Used KafkaUtils module in PySpark to create an input stream that directly pulls messages from Kafka broker.
  • Worked on partitioning Hive tables and running scripts parallel to reduce run time of the scripts.
  • Extensively worked on creating an End-to-End data pipeline orchestration using NIFI.
  • Implemented business logic by writing UDFs in Spark Scala and configuring CRON Jobs.
  • Provided design recommendations and resolved technical problems.
  • Assisted with data capacity planning and node forecasting.
  • Involved in performance tuning and troubleshooting Hadoop cluster.
  • Developed H-catalog Streaming code to stream the JSON data into Hive (EDW) continuously.
  • Administrated Hive, Kafka installing updates, patches, and upgrades.
  • Supported code/design analysis, strategy development and project planning.
  • Managed and reviewed Hadoop log files.
  • Evaluated suitability of Hadoop and its ecosystem to project and implemented various proof of concept applications to eventually adopt them to benefit from the Hadoop initiative.

Environment: Spark, Scala, Hive, Maven, Microsoft Azure services like HDInsight, BLOB, ADLS, Azure Data Bricks, Azure Data Lake, Micro services, GitHub, Splunk, PySpark.

Data Warehouse Developer

Confidential

Responsibilities:

  • Assisted in gathering business requirement from end users.
  • Coordinated tasks with onsite and offshore team members in India.
  • Developed Shared Containers for reusability in all the jobs for several projects.
  • Worked with Metadata Definitions, Import and Export of Data stage jobs using Data Stage Manager.
  • Involved in creating different projects parameters using Administrator.
  • Used Data Stage Director for running, monitoring and scheduling the Jobs.
  • Used Audit Sequence Jobs to run Each and Every Data Stage job which generates Audit key which will be used in Audit purpose.
  • Familiar with import/export of Data Stage Components (Jobs, DS Routines, DS Transforms, Table Definitions etc.) between Data Stage Projects, use of Dataset Management (DSM) utility and multiple job compile utility with use of Data Stage Manager.
  • Used Shared container for simplifying design and easy maintenance purpose.
  • Experienced in fine tuning, Trouble shooting, bug fixing, defect analysis and testing of Data Stage Jobs.
  • Used Unix shell scripts to invoke Data Stage jobs from command line.
  • Involved in Unit testing, Functional testing and Integration testing. Designed the Target Schema definition and Extraction, Transformation and Loading (ETL) using Data stage.
  • Developed SQL queries to performed DDL, DML, and DCL.

Environment: IBM WebSphere Data Stage EE/7.0/6.0 (Manager, Designer, Director, Administrator), Ascential Profile Stage 6.0, Ascential Quality Stage 6.0, Erwin, TOAD, Autosys, Oracle 9i, PL/SQL, SQL, Unix Shell Scripts, Sun Solaris, Windows 2000.

We'd love your feedback!