Azure Data Engineer Resume
Virginia Beach, VirginiA
SUMMARY
- Having around 8 years of total IT experience with over 5 years’ experience in Big Data Hadoop experience in Development and Design of Java based enterprise applications.
- Extensive working experience on Hadoop eco - system components like Hadoop, HDFS, Map Reduce, Hive, Sqoop, Flume, Spark, Kafka, Oozie and Zookeeper.
- Implemented performance tuning techniques for Spark-SQLqueries.
- Strong knowledge on Hadoop HDFS architecture, Map-Reduce (MRv1) and YARN (MRv2) framework.
- Strong hands on Experience in publishing the messages to various Kafka topics using Apache NIFI and consuming the message to HBase using Spark and Python.
- Experience in Developing Spark applications using Spark - SQL in Data bricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Experience with MS SQL Server Integration Services (SSIS), T-SQL skills, stored procedures, triggers.
- Design and develop Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
- Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion.
- Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Data frame, Spark SQL API's.
- Developed Sqoop scripts to migrate data from Teradata, Oracle to Big data Environment.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4), Yarn distributions (CDH 5.X).
- Implemented real time data streaming pipeline usingAWS Kinesis, Lambda, and Dynamo DBand deployedAWS Lambda codefrom Amazon S3 buckets.
- Work on large scale data transfer across different Hadoop clusters, implement new technology stacks on Hadoop clusters using Apache Spark.
- Added support for AWSS3 and RDS to host static/media files and the database into Amazon Cloud.
- Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Auto scaling, Cloud Watch and SNS.
- Performed Data scrubbing and processing with Oozie and for workflow automation and coordination.
- Hands on experience in analyzing log files for Hadoop and eco-system services and finding root cause.
- Hands on experience on handling different file formats like AVRO, PARQUET, Sequential files, MAP Files, CSV, xml, log ORC and RC.
- Experience with NoSQL Database HBase, Cassandra, MongoDB.
- Experience with AIX/Linux RHEL, UNIX Shell Scripting and SQL Server 2008.
- Worked on data search tool Elastic Search and data collection tool Logstash.
- Strong knowledge in Hadoop cluster installation, capacity planning and performance tuning, benchmarking, disaster recovery plan and application deployment in production cluster.
- Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
- Exposed into methodologies Scrum, Agile and Waterfall.
TECHNICAL SKILLS
Programming Languages: Java, Python, SQL, and C/C++
Big Data Ecosystem: Hadoop, Map Reduce, Kafka, Spark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Zookeeper, Talend.
Hadoop Distributions: Cloudera Enterprise, Horton Works, EMC Pivotal.
Databases: Oracle, SQL Server, PostgreSQL.
Web Technologies: HTML, XML, JQuery, Ajax, CSS, JavaScript, JSON.
Streaming Tools: Kafka
Testing: Hadoop Testing, Hive Testing, MRUnit.
Operating Systems: Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP.
Cloud: AWS EMR, Glue, RDS, Cloud Watch, S3, Redshift Cluster, Kinesis, Dynamo DB.
Technologies and Tools: Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services, Hibernate, Maven, GitHub, Bamboo.
PROFESSIONAL EXPERIENCE
Confidential, Virginia Beach, Virginia
Azure Data Engineer
Responsibilities:
- Build Data Pipleline Architecture on Azure Cloud platform using NiFi, Azure DataLake Storage Service, Azure HD Insight, Airflow and Data Engineer tool.
- Designed and developed scalable and cost-effective architecture in Azure Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.
- Engage with business users to gather requirements, design visualizations and provide training to use self-service BI tools.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
- Develop conceptual solutions & create proof-of-concepts to demonstrate viability of solutions.
- Technically guide projects through to completion within target timeframes.
- Collaborate with application architects and DevOps.
- Identify and implement best practices, tools and standards.
- Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
- Build Complex distributed systems involving huge amount data handling, collecting metrics building data pipeline, and Analytics.
- Design and Implement an internal process improvement:An automating manual process, an optimizing data delivery, re-designing infrastructure for greater scalability, An Optimize performance tuning.
- Implementing the Data Quality and content Validation by using tools like Spark,Scala,Hive,Nifi.
- Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Sparkframework and Power BI for data visualization.
Confidential, Valley Forge, PA
Big Data Developer
Responsibilities:
- Implementing the Proof of Concept (POC) for ETL Abinitio graph concepts which need to be migrated into Spark using scala and python (Pyspark).
- Develop Data pipelines using Sqoop, Spark, and Map reduce and Hive to Ingest, transform and analyze customer behavior data.
- Developed a data pipeline using Kafka, Spark Streaming and Hive to ingest the data from data lakes to Hadoop distributed file system.
- ImplementedSparkusing python andSparkSQL for faster processing of data and algorithms for real time analysis inSpark.
- UsedSparkfor interactive queries, processing of streaming data and integration with popular NOSQL database for huge volume of data.
- Involved in converting Hive/SQL queries intoSparktransformations usingSparkRDDs and Python.
- Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, Map Reduce and then loading data into HDFS.
- Extracting the Files from the RDBMS (DB2) by using Sqoop into Hadoop file system (HDFS) to process the workflow.
- Implementing the Partitioning and bucketing for faster query processing in Hive Query Language (HQL).
- Involving in Converting the HIVE/SQL queries into Spark transformation using Spark Data frames, Datasets and User defined functions (UDF's).
- Design Hive queries and Pig Script to perform Data Analysis, Data transfer and Table design.
- Evaluating the Data between the ETL and Hadoop to ensure Data quality.
- Responsible in creating mappings and workflows to extract and load data from relational databases, flat file sources and legacy systems.
- Testing on Apache Tez Framework and Hadoop Map Reduce frameworks for building high performance batch and interactive Data Processing Applications.
- Reconciling the data on daily basis in between the ETL and Hive tables by using a compare tool which is implemented in spark framework using Pyspark.
- Fine tune Hadoop applications for high performance and throughput, troubleshoot and debug any Hadoop ecosystem run time issues.
- Performing Data validation operation between ETL and Apache Hive tables.
- Developing the Linux shell scripting for Deploying and running the migrated Hadoop Application in Production Servers.
- Developing Workflows for scheduling and orchestrating the Hadoop Process.
Confidential
SQL Developer
Responsibilities:
- Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
- Extensively usedSparkstack to develop preprocessing job which includes RDD, Datasets and Data frames Api's to transform the data for upstream consumption.
- Developed Real-time data processing applications by using Scala and Python and implemented ApacheSparkStreaming from various streaming sources like Kafka, Flume and JMS.
- Replaced the existing Map Reduce programs intoSparkapplication using Scala.
- Built on premise data pipelines using Kafka andSparkstreaming using the feed from API streaming Gateway REST service.
- Developed the Hive UDF's to handle data quality and create filtered datasets for further processing
- Experienced in writing Sqoop scripts to import data into Hive/HDFS from RDBMS.
- Good knowledge on Kafka streams API for data transformation.
- Developed oozie workflow for scheduling & orchestrating the ETL process.
- Used Talend tool to create workflows for processing data from multiple source systems.
- Created sample flows in Talend, Stream sets with custom coded jars and analyzed the performance of Stream sets and Kafka steams.
- Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
- Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
- Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Deployed applications using Jenkins framework integrating Git- version control with it.
- Participated in production support on a regular basis to support the Analytics platform
- Used Rally for task/bug tracking.
- Used GIT for version control.