We provide IT Staff Augmentation Services!

Azure Data Platform Engineer Resume

3.00/5 (Submit Your Rating)

Framingham, MA

SUMMARY

  • Senior Hadoop developer with 7+ years of professional IT experience with 4+ years of Big Data consultant experience in Hadoop ecosystem components in ingestion, Data modeling, querying, processing, storage, analysis, Data Integration and Implementing enterprise level systems spanning Big Data.
  • Extensive professional experience in full Software Development Life Cycle (SDLC), Agile Methodology and maintenance in Azure, Azure Databricks (SPARK), Hadoop, Data Warehousing, Scala, & Python.
  • Expertise in Hadoop (HDFS, MapR, Yarn, Hive, Pig, HBase, Zookeeper, Oozie, & Sqoop), Spark - (Spark Core, Spark SQL, Spark Streaming),AWSservices-(Redshift, EMR, EC2, S3, CloudWatch, Lambda, Step Function, Glue, &Atana).
  • A skilled developer with strong problem solving, debugging and analytical capabilities, who actively engages in understanding customer requirements.
  • Ability to work independently and collaboratively and to communicate effectively with non-technical coworkers.
  • Experience in working on various Hadoop data access components like MAPREDUCE, PIG, HIVE, HBASE, SPARK and KAFKA.
  • Experience on handling Hive queries using Spark SQL that integrates with Spark environment
  • Having good noledge on Hadoop data management components like HDFS and YARN.
  • Hands on experience in using various Hadoop workflow components like SQOOP, FLUME and KAFKA.
  • Worked on Hadoop data operation components like ZOOKEEPER and OOZIE.
  • Working noledge on AWS technologies like S3 and EMR for storage, big data processing and analysis.
  • Good understanding of Hadoop security components like RANGER and KNOX.
  • Good experience working with Hadoop distributions such as HORTONWORKS and CLOUDERA.
  • Excellent programming skills at higher level of abstraction using SCALA and JAVA.
  • Experience in Java programming with skills in analysis, design, testing and deploying with various technologies like J2EE, JavaScript, JSP, JDBC, HTML, XML and JUNIT.
  • Having good noledge on Apache Spark components including SPARK CORE, SPARK SQL, SPARK STREAMING and SPARK MLLIB.
  • Experience in performing transformations and actions on Spark RDDS using Spark Core.
  • Experience in using Broadcast variables, Accumulator variables and RDD caching in Spark.
  • Experience in troubleshooting Cluster jobs using Spark UI
  • Experience working with Cloudera Distribution Hadoop (CDH) and Horton works data platform (HDP).
  • Expert in Hadoop and Big data ecosystem including Hive, HDFS, Spark, Kafka, MapReduce, Sqoop, Oozie and Zookeeper
  • Good Knowledge on Hadoop Cluster architecture and monitoring teh cluster
  • Hands-on experience in distributed systems technologies, infrastructure administration, monitoring configuration
  • Expertise in data transformation & analysis using Spark, Hive
  • Knowledge of writing Hive Queries to generate reports using Hive Query Language
  • Hands on experience with teh Spark SQL for complex data transformations using Scala programming language.
  • Developed Spark code using Python/Scala and Spark-SQL for faster testing and processing of data
  • Good noledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts
  • Extensive experience in data ingestion technologies like Flume, Kafka, Sqoop and NiFi
  • Utilize Flume, Kafka and NiFi to gain real-time and near real-time streaming data in HDFS from different data sources
  • Good in analyzing data using HiveQL and custom MapReduce program in Java
  • Good Knowledge in working with AWS (Amazon Web Services) cloud platform
  • Good noledge in Unix shell commands
  • Experience in analyzing Log files for Hadoop and eco system services and finding root cause and setting up and managing teh batch scheduler on Oozie
  • Thorough noledge of Release management, CI/CD process using Jenkins and Configuration management using Visual Studio Online
  • Experience in extracting teh data from RDBMS in to HDFS using Sqoop Ingestion, collecting teh logs from log collector into HDFS using Flume
  • Used Project Management services like JIRA for handling service requests and tracking issues.
  • Good experience with Software methodologies like Agile and Waterfall.
  • Experienced working with Zookeeper to provide coordination services to teh cluster
  • Skilled in Tableau 9 for data visualization, Reporting and Analysis
  • Extensively involved through teh Software Development Life Cycle (SDLC) from initial planning through implementation of teh projects by using Agile and waterfall methodologies
  • Good team player with ability to solve problems, organize and prioritize multiple tasks.

TECHNICAL SKILLS

Data Access Tools: HDFS, YARN, Hive, Pig, HBase, Solr, Impala, Spark Core, Spark SQL, Spark Streaming

Data Management: HDFS, YARN

Data Workflow: Sqoop, Flume, Kafka

Data Operation: Zookeeper, Oozie

Big Data Distributions: Horton works, Cloudera

Cloud Technologies: AWS (Amazon Web Services) EC2, S3, IAM, CLOUD WATCH, Dynamo DB, SNS, SQS, EMR, KINESIS

Programming & Languages: Java, Scala, Pig Latin, HQL, SQL, Shell Scripting, HTML, CSS, JavaScript

SDLC: Agile/SCRUM, Waterfall

PROFESSIONAL EXPERIENCE

Confidential, Framingham, MA

Azure Data Platform Engineer

Responsibilities:

  • Worked on Talend ETL tool and used features like context variable and database components like input to oracle, output to oracle, tFile compare, tFile copy, to oracle close ETL components
  • Extracted data from teh legacy system and loaded/integrated into another database through teh ETL process.
  • Transferred data from different data sources into HDFS systems using Kafka producers, consumers, Kafka brokers and used Zookeeper as built coordinator between different brokers in Kafka.
  • Performed data migrations from on-prem to Azure Data Factory and Azure Data Lake.
  • Used Kafka and Spark Streaming for data ingestion and cluster handling in real time processing.
  • Developed flow XML files using Apache NIFI, a workflow automation tool to ingest data into HDFS.
  • Involved in Designing Snowflake Schema for Data Warehouse, ODS architecture by using tools like Data Model, Erwin.
  • Involved in maintaining and updating Metadata Repository with details on teh nature and use of applications or data transformations to facilitate impact analysis.
  • Developed integration checks around teh Pyspark framework for Processing of large datasets.
  • Worked on migration of Pyspark framework into AWS Glue for enhanced processing.
  • Implemented OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse.
  • Experience in working with Cloudera (CDH4 &CDH5), Horton Works, Amazon EMR, Azure HDINSIGHT on multi-node cluster.
  • Created Pipelines in ADF using Linked Services/Datasets/Pipeline to Extract, Transform and load datafrom different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics. Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing teh data inAzure Databricks.
  • Inserted Overwriting teh HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment.
  • Hadoop metadata management by extracting and maintaining metadata from Hive tables with Hive QL.
  • Worked with importing metadata into Hive & migrated existing tables, applications to work on Hive and Spark.
  • Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting.
  • Worked with Spark and improved teh performance and optimized teh existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, RDD's, Spark YARN.
  • Designed and developed automation test scripts using Python and analyzed teh SQL scripts and designed teh solution to implement using Pyspark.
  • Developed a Data flow to pull teh data from teh REST API using Apache Nifi with context configuration enabled and developed entire spark applications in Python (PySpark) on distributed environment. Implemented Micro Services architecture using spring boot framework.
  • Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Spark and Sqoop jobs.
  • Implemented workflows using Apache Oozie framework to automate tasks.
  • Imported and exported data into HDFS and Hive/Impala tables from Relational Database Systems using Sqoop.
  • Developed data pipeline using Flume, Sqoop, and Pig to extract teh data from weblogs and store in HDFS.
  • Collected and aggregated large amounts of log data using Flume and tagging data in HDFS for further analysis.
  • Developed Java Map Reduce programs for teh analysis of sample log file stored in cluster.
  • Worked in developing Pig Scripts for data capture change and delta record processing between newly arrived data and already existing data in HDFS.
  • Developed CI/CD system with Jenkins on Kubernetes container environment, utilizing Kubernetes and Docker for teh CI/CD system to build, test and deploy.
  • Implemented cluster services using Docker and Kubernetes to manage local deployments in Kubernetes by building a self-hosted Kubernetes cluster using Terraform and Ansible, deploying application containers.
  • Implemented autantication and authorization using spring security.
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Azure cloud.
  • Configured Azure Container Registry for building and publishing Docker container images and deployed them into Azure Kubernetes Service (AKS).
  • Working noledge on Natural Language Processing (NLP) and Natural Language Generation (NLG) using Python. Worked collaboratively with customers and team members supporting large business initiatives.
  • Knowledge of Information Extraction, NLP algorithms coupled with Deep Learning.
  • Developed a Spark job which indexes data into ElasticSearch from external Hive tables which are in HDFS.
  • Worked on migrating application into REST based Microservices to provide all CRUD capabilities using Spring Boot. Wrote Microservices to export/import data and task scheduling using Spring Boot and Hibernate.
  • Utilized machine learning algorithms: Linear regression, Naive Bayes, Random Forests, KNN for data analysis
  • Built and deployed Java application into multiple Unix based environments and produced both unit and Functional test results with release notes.
  • Performed Sentiment Analysis in python by implementing NLP techniques: Web Scraping, Text Vectorization, Data Wrangling, Bag of Words, TF-IDF score to compute teh sentiment score and analyze teh reviews.
  • Performed data wrangling, data imputation and EDA using pandas, Numpy, Sklearn and Matplotlib in Python.
  • Worked on Microsoft Azuretoolsets including Azure DataFactory Pipelines, Azure Databricks, Azure Data Lake Storage.
  • Extensively used Agile methodology as teh Organization Standard to implement teh data models

Environment: Hadoop 2.x, HDFS, MapReduce, Pyspark, Spark SQL, ETL, Hive, Pig, Oozie, Databricks, Java, spring, Sqoop, Azure, Star Schema, Python, Nifi, Cassandra, Scala, Power BI, Machine Learning.

Confidential, Plano Texas

Big Data Engineer

Responsibilities:

  • Designed and Developed Extract, Transform, and Load (ETL) code using Informatica Mappings to load data from heterogeneous Source systems flat files, XML’s, MS Access files, Oracle to target system Oracle under Stage, tan to data warehouse and tan to Data Mart tables for reporting.
  • Created Data mappings, Tech Design, loading strategies for ETL to load newly created or existing tables.
  • Worked with Kafka for building robust and fault tolerant data Ingestion pipeline for transporting streaming data into HDFS and implemented Kafka Custom encoders for custom input format to load data into Kafka Partitions.
  • Created Kafka broker for structured streaming to get structured data by schema.
  • Extracting real time guest data using Kafka and Spark streaming by Creating DStreams and converting them into RDD, processing it and pushing them into Cassandra leveraging DataStax Spark-Cassandra connector.
  • Developed Elastic Search Connector using Kafka Connect API with source as Kafka and sink as elastic search.
  • Worked on performance tuning of Apache NIFI workflow to optimize teh data ingestion speeds.
  • Integrated Map Reduce with HBase to import bulk amount of data into HBase using Map Reduce Programs.
  • Developed numerous MapReduce jobs for Data Cleansing and Analyzing Data in Impala.
  • Designed appropriate Partitioning/Bucketing schema in HIVE for efficient data access during analysis and designed a data warehouse using Hive external tables and created Hive queries for analysis.
  • Configured Hive meta store with MySQL to store teh metadata for Hive tables and used Hive to analyze data ingested into HBase using Hive-HBase integration.
  • Worked on migration of an existing feed from Hive to Spark to reduce latency of feeds in existing HiveQL.
  • Developed Oozie Workflows for daily incremental loads to get data from Teradata and import into Hive tables.
  • Validate, manipulate and perform exploratory data analysis tasks using Python along with its data-specific libraries Pandas and Pyspark, interpret and extract meaningful insights from large data sets consist of millions of records
  • Retrieved data from Hadoop Cluster by developing a pipeline using Hive (HQL), SQL to retrieve data from Oracle database and used Extract, Transform, and Load (ETL) for data transformation.
  • Worked with Flume for building fault tolerant data Ingestion pipeline for transporting streaming data into HDFS.
  • Installed and configured Hive and written Hive UDFs and used piggy bank a repository of UDF’s for Pig Latin.
  • Applied Spark advanced procedures like text analytics and processing using teh in-memory processing.
  • Used spring framework for Dependency Injection and integrated with Hibernate.
  • Developed multiple spark batch jobs using Spark SQL and performed transformations using many APIs and updated master data in Cassandra database as per teh business requirement.
  • Developed data models and data migration strategies utilizing concepts of snowflake schema.
  • Worked on data pre-processing and cleaning teh data to perform feature engineering and performed data imputation techniques for teh missing values in teh dataset using Python.
  • Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce 2.0 and store into HDFS (Hortonworks).
  • Setup Docker to automate container deployment through Jenkins and Dealt with Docker Hub, making Docker Images and taking care of various Images essentially for middleware establishments.
  • Configured applications that run multi-container Docker applications by utilizing teh Docker-Compose tool which uses a file configured in YAML format and used Kubernetes to manage containerized applications using its nodes, Config-Maps, selector, Services and deployed application containers as Pods.
  • Used Jenkins pipelines to drive all microservices builds out to teh Docker registry and tan deployed to Kubernetes.
  • Used Spark-SQL to Load Parquet data and created Datasets defined by Case classes and handled structured data using Spark SQL which were finally stored into Hive tables for downstream consumption.
  • Experience in DevelopingPySparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming teh data to uncover insights into teh customer usage patterns.
  • Developed and deployed Spark application using Pyspark to compute popularity score for all teh contents using an algorithm and load teh data into Elastic Search for App content management team to consume
  • Converted teh data science model developed by data scientists from python to Pysparkand operationalized teh model using python and shell scripts to automate teh process of running model on new data as required and save teh results to final Phoenix tables
  • Used Microservices architecture, with Spring Boot based services interacting through a combination of a REST and Spring Boot
  • Extensively involved in creating and designing programs which involves in procedures, triggers and sequences to access oracle and used Microservices architecture with Spring boot services for interacting through a combination of REST and grasping AWS to build, test and deploy Microservices.
  • Used Tableau to convey teh results by using dashboards to communicate with team members and with other data science teams, marketing and engineering teams.
  • Generated teh data cubes using Hive, Pig, JAVA Map-Reducing on provisioning Hadoop cluster in AWS.
  • Expertise in Performance Tuning Tableau Dashboards and Reports built on huge sources.
  • AWS EMR to process big data across Hadoop clusters of virtual servers on Amazon Simple Storage Service (S3).
  • Expertise in AWS data migration between different database platforms like Local SQL Server to Amazon RDS, EMR HIVE and experience in managing and reviewing Hadoop log files in AWS S3.
  • Built and supported several AWS, multi-server environment's using Amazon EC2, EMR, EBS, Redshift and deployed teh Big Data Hadoop application on AWS cloud.
  • Provided support on AWS Cloud infrastructure automation with multiple tools including Gradle, Chef, Nexus, Docker and monitoring tools such as Splunk and CloudWatch.
  • Used Jenkins pipelines to drive all microservices builds out to teh Docker registry and tan deployed to Kubernetes.
  • Design and Develop ETL Processes in AWS Glue to migrate Campaign data from external sources like S3, ORC/Parquet/Text Files into AWS Redshift.
  • Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).
  • DataExtraction, aggregations and consolidation of Adobe data within AWS Glue using PySpark.
  • Implemented Serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.
  • Scheduled clusters with Cloud watch and created Lambda to generate operational alerts for various workflows.
  • Worked on AWS EC2, IAM, S3, LAMBDA, EBS, Elastic Load balancer (ELB), auto scaling group services.
  • Involved in Agile Methodologies, Daily Scrum meetings, Sprint planning's and strong experience in SDLC.

Environment: Hadoop 2.x, HDFS, MapReduce, Apache Spark, Spark SQL, Spark Streaming, Scala, Java, Spring, Pig, Hive, Oozie, Sqoop, Kafka, Flume, Nifi, Zookeeper, Informatica, Databricks, MongoDB, AWS, Python, Linux, Snowflake, Tableau.

Confidential, Chicago, IL

Spark/Hadoop Developer

Responsibilities:

  • Responsible to collect, clean, and store data for analysis using Kafka, Sqoop, Spark, HDFS
  • Used Kafka and Spark framework for real time and batch data processing
  • Ingested large amount of data from different data sources into HDFS using Kafka
  • Implemented Spark using Scala and performed cleansing of data by applying Transformations and Actions
  • Used Case Class in Scala to convert RDD’s into Data Frames in Spark
  • Processed and Analyzed data in stored in HBase and HDFS
  • Developed Spark jobs using Scala on top of Yarn for interactive and Batch Analysis.
  • Developed UNIX shell scripts to load large number of files into HDFS from Linux File System.
  • Experience in querying data using Spark SQL for faster processing of teh data sets.
  • Offloaded data from EDW into Hadoop Cluster using Sqoop.
  • Developed Sqoop scripts for importing and exporting data into HDFS and Hive
  • Created Hive internal and external Tables by Partitioning, bucketing for further Analysis using Hive
  • Used Oozie workflow to automate and schedule jobs
  • Used Zookeeper for maintaining and monitoring clusters
  • Exported teh data into RDBMS using Sqoop for BI team to perform visualization and to generate reports
  • Continuously monitored and managed teh Hadoop Cluster using Cloudera Manager
  • Used JIRA for project tracking and participated in daily scrum meetings

Confidential

Python Developer

Responsibilities:

  • Developed and designed Python based API (RESTful Web Service) to interact with company’s website.
  • Wrote Python code and actively participated in teh procedure to automate processes.
  • Build and test functionality within a production pipeline.
  • Implemented Python code to fix bugs and provides upgrades to existing functionality.
  • Provided fault isolation and root cause analysis for technical problems.
  • Highly efficient in handling multi-tasking issues in a fast-paced environment.
  • Created Business Logic using Python to create Planning and Tracking functions.
  • Used Pandas datamining library for statistics Analysis & NumPy for Numerical analysis.
  • Developed multi-threaded standalone app in Python, PHP, C++ to view Circuit parameters and performance.
  • Developed Business Logic using Python on Django Web Framework.
  • Designed and managed API system deployment using fast http server and Amazon AWS architecture.
  • Developed tools using Python, Shell scripting, XML to automate some of teh menial tasks.
  • Developed internal auxiliary web apps using Python Flask framework with Angular.js and Twitter Bootstrap CSS / HTML framework.
  • Developed tools using Python, Shell scripting, XML to automate some of teh menial tasks.
  • Used Django configuration to manage URLs and application parameters.
  • Created PyUnit test scripts and used for unit testing.
  • Developed Merge jobs in Python to extract and load data into MySQL database.
  • Developed user interfaces using HTML5 and JavaScript.

Environment: Python, Django, Python SDK, AWS, Flash, PHP, Numpy, Pandas, PyQuery, DOM, Bootstrap, XML, HTML5, JavaScript, Angular.js, JSON, Rest, Apache Web Server, Git Hub, MySQL, LINUX.

We'd love your feedback!