We provide IT Staff Augmentation Services!

Azure Data Engineer Resume

4.00/5 (Submit Your Rating)

Tampa, FL

SUMMARY

  • Having around 8 years of total IT experience with over 5 years experience in Big Data Hadoop experience in Development and Design of Java based enterprise applications.
  • Extensive working experience on Hadoop Eco - system components like Hadoop, HDFS, MapReduce, Hive, Sqoop, Flume, Spark, Kafka, Oozie and Zookeeper.
  • Implemented performance tuning techniques for Spark-SQL queries.
  • Strong knowledge on Hadoop HDFS architecture, Map-Reduce(MRv1) and YARN(MRv2) framework.
  • Strong hands on Experience in publishing the messages to various Kafka topics using Apache NIFI and consuming the message to HBase using Spark and Python.
  • Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
  • Experience with MS SQL Server Integration Services (SSIS), T-SQL skills, stored procedures, triggers.
  • Design and develop Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
  • Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion.
  • Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Data frame, Spark SQL API's.
  • Developed Sqoop scripts to migrate data from Teradata, Oracle to Bigdata Environment.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
  • Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4), Yarn distributions (CDH 5.X).
  • Implemented real time data streaming pipeline usingAWS Kinesis, Lambda, and Dynamo DBand deployedAWS Lambda codefrom Amazon S3 buckets.
  • Work on large scale data transfer across different Hadoop clusters, implement new technology stacks on Hadoop clusters using Apache Spark.
  • Added support for AWSS3 and RDS to host static/media files and the database into Amazon Cloud.
  • Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Autoscaling, CloudWatch and SNS.
  • Performed Data scrubbing and processing with Oozie and for workflow automation and coordination.
  • Hands on experience in analyzing log files for Hadoop and eco-system services and finding root cause.
  • Hands on experience on handling different file formats like AVRO, PARQUET, Sequential files, MAP Files, CSV, xml, log ORC and RC.
  • Experience with NoSQL Database HBase, Cassandra, MongoDB.
  • Experience with AIX/Linux RHEL, Unix Shell Scripting and SQL Server 2008.
  • Worked on data search tool Elastic Search and data collection tool Logstash.
  • Strong knowledge in Hadoop cluster installation, capacity planning and performance tuning, benchmarking, disaster recovery plan and application deployment in production cluster.
  • Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
  • Exposed into methodologies Scrum, Agile and Waterfall.

TECHNICAL SKILLS

Programming Languages: Java, Python, SQL, and C/C++

Big Data Ecosystem: Hadoop, MapReduce, Kafka, Spark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Zookeeper, Talend.

Hadoop Distributions: Cloudera Enterprise, Horton Works, EMC Pivotal.

Databases: Oracle, SQL Server, PostgreSQL.

Web Technologies: HTML, XML, JQuery, Ajax, CSS, JavaScript, JSON.

Streaming Tools: Kafka

Testing: Hadoop Testing, Hive Testing, MRUnit.

Operating Systems: Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP.

Cloud: AWS EMR, Glue, RDS, CloudWatch, S3, Redshift Cluster, Kinesis, DynamoDB.

Technologies and Tools: Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services, Hibernate, Maven, GitHub, Bamboo.

Application Servers: Tomcat, JBoss.

IDE’s: Eclipse, Net Beans, IntelliJ.

PROFESSIONAL EXPERIENCE

Confidential, Tampa, FL

Azure Data Engineer

Responsibilities:

  • Build Data Pipleline Architecture on Azure Cloud platform using NiFi, Azure DataLake Storage Service, Azure HD Insight, Airflow and Data Engineer tool.
  • Designed and developed scalable and cost-effective architecture in Azure Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
  • Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
  • Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
  • Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.
  • Engage with business users to gather requirements, design visualizations and provide training to use self-service BI tools.
  • Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
  • Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
  • Develop conceptual solutions & create proof-of-concepts to demonstrate viability of solutions.
  • Technically guide projects through to completion within target timeframes.
  • Collaborate with application architects and DevOps.
  • Identify and implement best practices, tools and standards.
  • Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
  • Build Complex distributed systems involving huge amount data handling, collecting metrics building data pipeline, and Analytics.
  • Design and Implement an internal process improvement:An automating manual process, an optimizing data delivery, re-designing infrastructure for greater scalability, An Optimize performance tuning.
  • Implementing the Data Quality and content Validation by using tools like Spark,Scala,Hive,Nifi.
  • Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Sparkframework and Power BI for data visualization.

Confidential, Washington.

Big Data Developer

Responsibilities:

  • Implementing the Proof of Concept (POC) for ETL Abinitio graph concepts which need to be migrated into Spark using scala and python (Pyspark).
  • Develop Data pipelines using Sqoop, Spark, Map reduce and Hive to Ingest, transform and analyze customer behavior data.
  • Developed a data pipeline using Kafka, Spark Streaming and Hive to ingest the data from data lakes to Hadoop distributed file system.
  • ImplementedSparkusing python andSparkSQL for faster processing of data and algorithms for real time analysis inSpark.
  • UsedSparkfor interactive queries, processing of streaming data and integration with popular NOSQL database for huge volume of data.
  • Involved in converting Hive/SQL queries intoSparktransformations usingSparkRDDs and Python.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Extracting the Files from the RDBMS(DB2) by using Sqoop into Hadoop file system (HDFS) to process the workflow.
  • Implementing the Partitioning and bucketing for faster query processing in Hive Query Language (HQL).
  • Involving in Converting the HIVE/SQL queries into Spark transformation using Spark Data frames, Datasets and User defined functions (UDF's).
  • Design Hive queries and Pig Script to perform Data Analysis, Data transfer and Table design.
  • Evaluating the Data between the ETL and Hadoop to ensure Data quality.
  • Responsible in creating mappings and workflows to extract and load data from relational databases, flat file sources and legacy systems.
  • Testing on Apache Tez Framework and Hadoop Map Reduce frameworks for building high performance batch and interactive Data Processing Applications.
  • Reconciling the data on daily basis in between the ETL and Hive tables by using a compare tool which is implemented in spark framework using Pyspark.
  • Fine tune Hadoop applications for high performance and throughput, troubleshoot and debug any Hadoop ecosystem run time issues.
  • Performing Data validation operation between ETL and Apache Hive tables.
  • Developing the Linux shell scripting for Deploying and running the migrated Hadoop Application in Production Servers.
  • Developing Workflows for scheduling and orchestrating the Hadoop Process.

Confidential, Greenwood Village, CO.

Hadoop Developer

Responsibilities:

  • Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
  • Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization.
  • Ensure that application continues to function normally through software maintenance and testing in production environment.
  • Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
  • Implemented real-time solutions for Money Movement and transactional data using Kafka, Spark Streaming, Hbase.
  • The project also includes a spread of big data tools and programming languages like Sqoop, Python, Oozie etc.
  • Worked on scheduling Oozie workflow engine to run multiple jobs.
  • Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
  • Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
  • Experience in using SparkSQL to handle structured data from Hive in AWS EMR Platform (M4.Xlarge,M5.12Xlarge clusters).
  • Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
  • Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
  • Experienced in optimizing Hive queries, joins to handle different data sets.
  • Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
  • Actively involved in code review and bug fixing for improving the performance.
  • Good experience in handling data manipulation using python Scripts.
  • Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
  • Created Splunk dashboard to capture the logs for end to end process of data ingestion.
  • Written unit test cases for Pyspark code for CICD process.
  • Good knowledge about the configuration management tools like BitBucket/Github and Bamboo(CICD).

Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, agile methodologies, UNIX

Confidential, New York.

Hadoop Developer

Responsibilities:

  • Developed data pipeline using Sqoop,Spark, MapReduce, and Hive to ingest, transform and analyze, customer behavioral data.
  • Implemented Sparkusing python and SparkSQL for faster processing of data and algorithms for real time analysis inSpark.
  • Used Spark for interactive queries, processing of streaming data and integration with popular NOSQL database for huge volume of data.
  • Used the Spark- Cassandra Connector to load data to and from Cassandra. Real time streaming the data usingSparkwith Kafka.
  • Developing Kafka producers and consumers in java and integrating with apache storm and ingesting data into HDFS and HBase by implementing the rules in storm.
  • Develop efficient MapReduce programs in Python to perform batch processes on huge unstructured datasets.
  • Involved in converting Hive/SQL queries intoSparktransformations usingSparkRDDs and Python.
  • Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
  • Analyzed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behavior.
  • Created HBase tables and column families to store the user event data and wrote automated HBase test cases for data quality checks using HBase command line tools.
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Scheduled and executed workflows in Oozie to run Hive and Pig jobs and created UDF's to store specialized data structures in HBase and Cassandra.
  • Develop NiFi workflow to pick up the multiple retail files from ftp location and move those to HDFS on daily basis.
  • Worked withdeveloperteams on Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
  • Evaluated Hortonworks NiFi (HDF 2.0) and recommended solution to inject data from multiple data sources to HDFS & Hive using NiFi and importing data using Nifi tool from Linux servers.
  • Developed product profiles using Pig and commodity UDFs & developed Hive scripts in Hive QL to de-normalize and aggregate the data.
  • Optimizing existing algorithms in Hadoop usingSparkContext,Spark-SQL, Data Frames and Pair RDD's.
  • TunedSpark/Python code to improve the performance of machine learning algorithms for data analysis.
  • Performed data validation on the data ingested using MapReduce by building a custom model to filter all the invalid data and cleanse the data.
  • Developed interactive shell scripts for scheduling various data cleansing and data loading process.
  • Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
  • Teradata concepts were used for the early instance creation with the DBMS concepts.

Environment: Hadoop, MapReduce, Yarn,Spark, Hive, Pig, Kafka, HBase, Oozie, Sqoop, Python, Bash/Shell Scripting, Flume, HBase, Cassandra, Oracle, Core Java, Storm, HDFS, Unix, Teradata, NiFi, Eclipse

Confidential

SQL Developer

Responsibilities:

  • Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
  • Extensively used Sparkstack to develop preprocessing job which includes RDD, Datasets and Data frames Api's to transform the data for upstream consumption.
  • Developed Realtime data processing applications by using Scala and Python and implemented ApacheSparkStreaming from various streaming sources like Kafka, Flume and JMS.
  • Worked on extracting and enriching HBase data between multiple tables using joins inSpark.
  • Worked on writing APIs to load the processed data to HBase tables.
  • Replaced the existing MapReduce programs into Spark application using Scala.
  • Built on-premise data pipelines using Kafka andSparkstreaming using the feed from API streaming Gateway REST service.
  • Developed the Hive UDF's to handle data quality and create filtered datasets for further processing
  • Experienced in writing Sqoop scripts to import data into Hive/HDFS from RDBMS.
  • Good knowledge on Kafka streams API for data transformation.
  • Developed oozie workflow for scheduling & orchestrating the ETL process.
  • Used Talend tool to create workflows for processing data from multiple source systems.
  • Created sample flows in Talend, Stream sets with custom coded jars and analyzed the performance of Stream sets and Kafka steams.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
  • Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
  • Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns.
  • Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
  • Deployed applications using Jenkins framework integrating Git- version control with it.
  • Participated in production support on a regular basis to support the Analytics platform
  • Used Rally for task/bug tracking.
  • Used GIT for version control.

We'd love your feedback!