Azure Data Engineer Resume
Tampa, FL
SUMMARY
- Having around 8 years of total IT experience with over 5 years experience in Big Data Hadoop experience in Development and Design of Java based enterprise applications.
- Extensive working experience on Hadoop Eco - system components like Hadoop, HDFS, MapReduce, Hive, Sqoop, Flume, Spark, Kafka, Oozie and Zookeeper.
- Implemented performance tuning techniques for Spark-SQL queries.
- Strong knowledge on Hadoop HDFS architecture, Map-Reduce(MRv1) and YARN(MRv2) framework.
- Strong hands on Experience in publishing the messages to various Kafka topics using Apache NIFI and consuming the message to HBase using Spark and Python.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Experience with MS SQL Server Integration Services (SSIS), T-SQL skills, stored procedures, triggers.
- Design and develop Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL
- Azure Data Factory (ADF), Integration Run Time (IR), File System Data Ingestion, Relational Data Ingestion.
- Worked on creating Spark jobs that process the true source files and successful in performing various transformations on the source data using Spark Data frame, Spark SQL API's.
- Developed Sqoop scripts to migrate data from Teradata, Oracle to Bigdata Environment.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa.
- Hands on experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4), Yarn distributions (CDH 5.X).
- Implemented real time data streaming pipeline usingAWS Kinesis, Lambda, and Dynamo DBand deployedAWS Lambda codefrom Amazon S3 buckets.
- Work on large scale data transfer across different Hadoop clusters, implement new technology stacks on Hadoop clusters using Apache Spark.
- Added support for AWSS3 and RDS to host static/media files and the database into Amazon Cloud.
- Experience in project deployment using Heroku/Jenkins and using web services like Amazon Web Services (AWS) EC2, AWS S3, Autoscaling, CloudWatch and SNS.
- Performed Data scrubbing and processing with Oozie and for workflow automation and coordination.
- Hands on experience in analyzing log files for Hadoop and eco-system services and finding root cause.
- Hands on experience on handling different file formats like AVRO, PARQUET, Sequential files, MAP Files, CSV, xml, log ORC and RC.
- Experience with NoSQL Database HBase, Cassandra, MongoDB.
- Experience with AIX/Linux RHEL, Unix Shell Scripting and SQL Server 2008.
- Worked on data search tool Elastic Search and data collection tool Logstash.
- Strong knowledge in Hadoop cluster installation, capacity planning and performance tuning, benchmarking, disaster recovery plan and application deployment in production cluster.
- Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
- Exposed into methodologies Scrum, Agile and Waterfall.
TECHNICAL SKILLS
Programming Languages: Java, Python, SQL, and C/C++
Big Data Ecosystem: Hadoop, MapReduce, Kafka, Spark, Pig, Hive, YARN, Flume, Sqoop, Oozie, Zookeeper, Talend.
Hadoop Distributions: Cloudera Enterprise, Horton Works, EMC Pivotal.
Databases: Oracle, SQL Server, PostgreSQL.
Web Technologies: HTML, XML, JQuery, Ajax, CSS, JavaScript, JSON.
Streaming Tools: Kafka
Testing: Hadoop Testing, Hive Testing, MRUnit.
Operating Systems: Linux Red Hat/Ubuntu/CentOS, Windows 10/8.1/7/XP.
Cloud: AWS EMR, Glue, RDS, CloudWatch, S3, Redshift Cluster, Kinesis, DynamoDB.
Technologies and Tools: Servlets, JSP, Spring (Boot, MVC, Batch, Security), Web Services, Hibernate, Maven, GitHub, Bamboo.
Application Servers: Tomcat, JBoss.
IDE’s: Eclipse, Net Beans, IntelliJ.
PROFESSIONAL EXPERIENCE
Confidential, Tampa, FL
Azure Data Engineer
Responsibilities:
- Build Data Pipleline Architecture on Azure Cloud platform using NiFi, Azure DataLake Storage Service, Azure HD Insight, Airflow and Data Engineer tool.
- Designed and developed scalable and cost-effective architecture in Azure Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools.
- Engage with business users to gather requirements, design visualizations and provide training to use self-service BI tools.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Propose architectures considering cost/spend in Azure and develop recommendations to right-size data infrastructure.
- Develop conceptual solutions & create proof-of-concepts to demonstrate viability of solutions.
- Technically guide projects through to completion within target timeframes.
- Collaborate with application architects and DevOps.
- Identify and implement best practices, tools and standards.
- Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse.
- Build Complex distributed systems involving huge amount data handling, collecting metrics building data pipeline, and Analytics.
- Design and Implement an internal process improvement:An automating manual process, an optimizing data delivery, re-designing infrastructure for greater scalability, An Optimize performance tuning.
- Implementing the Data Quality and content Validation by using tools like Spark,Scala,Hive,Nifi.
- Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Sparkframework and Power BI for data visualization.
Confidential, Washington.
Big Data Developer
Responsibilities:
- Implementing the Proof of Concept (POC) for ETL Abinitio graph concepts which need to be migrated into Spark using scala and python (Pyspark).
- Develop Data pipelines using Sqoop, Spark, Map reduce and Hive to Ingest, transform and analyze customer behavior data.
- Developed a data pipeline using Kafka, Spark Streaming and Hive to ingest the data from data lakes to Hadoop distributed file system.
- ImplementedSparkusing python andSparkSQL for faster processing of data and algorithms for real time analysis inSpark.
- UsedSparkfor interactive queries, processing of streaming data and integration with popular NOSQL database for huge volume of data.
- Involved in converting Hive/SQL queries intoSparktransformations usingSparkRDDs and Python.
- Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
- Extracting the Files from the RDBMS(DB2) by using Sqoop into Hadoop file system (HDFS) to process the workflow.
- Implementing the Partitioning and bucketing for faster query processing in Hive Query Language (HQL).
- Involving in Converting the HIVE/SQL queries into Spark transformation using Spark Data frames, Datasets and User defined functions (UDF's).
- Design Hive queries and Pig Script to perform Data Analysis, Data transfer and Table design.
- Evaluating the Data between the ETL and Hadoop to ensure Data quality.
- Responsible in creating mappings and workflows to extract and load data from relational databases, flat file sources and legacy systems.
- Testing on Apache Tez Framework and Hadoop Map Reduce frameworks for building high performance batch and interactive Data Processing Applications.
- Reconciling the data on daily basis in between the ETL and Hive tables by using a compare tool which is implemented in spark framework using Pyspark.
- Fine tune Hadoop applications for high performance and throughput, troubleshoot and debug any Hadoop ecosystem run time issues.
- Performing Data validation operation between ETL and Apache Hive tables.
- Developing the Linux shell scripting for Deploying and running the migrated Hadoop Application in Production Servers.
- Developing Workflows for scheduling and orchestrating the Hadoop Process.
Confidential, Greenwood Village, CO.
Hadoop Developer
Responsibilities:
- Designed and developed scalable and cost-effective architecture in AWS Big Data services for data life cycle of collection, ingestion, storage, processing, and visualization.
- Involved in creating End-to-End data pipeline within distributed environment using the Big data tools, Spark framework and Tableau for data visualization.
- Ensure that application continues to function normally through software maintenance and testing in production environment.
- Leverage Spark features such as In-Memory processing, Distributed Cache, Broadcast, Accumulators, Map side Joins to implement data preprocessing pipelines with minimal latency.
- Implemented real-time solutions for Money Movement and transactional data using Kafka, Spark Streaming, Hbase.
- The project also includes a spread of big data tools and programming languages like Sqoop, Python, Oozie etc.
- Worked on scheduling Oozie workflow engine to run multiple jobs.
- Experience in creating python topology script to generate cloud formation template for creating the EMR cluster in AWS.
- Good knowledge on AWS Services like EC2, EMR, S3, Service Catalog, and Cloud Watch.
- Experience in using SparkSQL to handle structured data from Hive in AWS EMR Platform (M4.Xlarge,M5.12Xlarge clusters).
- Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, and Pair RDD's.
- Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Experienced in optimizing Hive queries, joins to handle different data sets.
- Involved in creating Hive tables (Managed tables and External tables), loading and analyzing data using hive queries.
- Actively involved in code review and bug fixing for improving the performance.
- Good experience in handling data manipulation using python Scripts.
- Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
- Created Splunk dashboard to capture the logs for end to end process of data ingestion.
- Written unit test cases for Pyspark code for CICD process.
- Good knowledge about the configuration management tools like BitBucket/Github and Bamboo(CICD).
Environment: Hadoop, Cloudera, Flume, HBase, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Oozie, agile methodologies, UNIX
Confidential, New York.
Hadoop Developer
Responsibilities:
- Developed data pipeline using Sqoop,Spark, MapReduce, and Hive to ingest, transform and analyze, customer behavioral data.
- Implemented Sparkusing python and SparkSQL for faster processing of data and algorithms for real time analysis inSpark.
- Used Spark for interactive queries, processing of streaming data and integration with popular NOSQL database for huge volume of data.
- Used the Spark- Cassandra Connector to load data to and from Cassandra. Real time streaming the data usingSparkwith Kafka.
- Developing Kafka producers and consumers in java and integrating with apache storm and ingesting data into HDFS and HBase by implementing the rules in storm.
- Develop efficient MapReduce programs in Python to perform batch processes on huge unstructured datasets.
- Involved in converting Hive/SQL queries intoSparktransformations usingSparkRDDs and Python.
- Handled importing data from different data sources into HDFS using Sqoop and also performing transformations using Hive, MapReduce and then loading data into HDFS.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Analyzed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behavior.
- Created HBase tables and column families to store the user event data and wrote automated HBase test cases for data quality checks using HBase command line tools.
- Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
- Scheduled and executed workflows in Oozie to run Hive and Pig jobs and created UDF's to store specialized data structures in HBase and Cassandra.
- Develop NiFi workflow to pick up the multiple retail files from ftp location and move those to HDFS on daily basis.
- Worked withdeveloperteams on Nifi workflow to pick up the data from rest API server, from data lake as well as from SFTP server and send that to Kafka broker.
- Evaluated Hortonworks NiFi (HDF 2.0) and recommended solution to inject data from multiple data sources to HDFS & Hive using NiFi and importing data using Nifi tool from Linux servers.
- Developed product profiles using Pig and commodity UDFs & developed Hive scripts in Hive QL to de-normalize and aggregate the data.
- Optimizing existing algorithms in Hadoop usingSparkContext,Spark-SQL, Data Frames and Pair RDD's.
- TunedSpark/Python code to improve the performance of machine learning algorithms for data analysis.
- Performed data validation on the data ingested using MapReduce by building a custom model to filter all the invalid data and cleanse the data.
- Developed interactive shell scripts for scheduling various data cleansing and data loading process.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
- Teradata concepts were used for the early instance creation with the DBMS concepts.
Environment: Hadoop, MapReduce, Yarn,Spark, Hive, Pig, Kafka, HBase, Oozie, Sqoop, Python, Bash/Shell Scripting, Flume, HBase, Cassandra, Oracle, Core Java, Storm, HDFS, Unix, Teradata, NiFi, Eclipse
Confidential
SQL Developer
Responsibilities:
- Research and recommend suitable technology stack for Hadoop migration considering current enterprise architecture.
- Extensively used Sparkstack to develop preprocessing job which includes RDD, Datasets and Data frames Api's to transform the data for upstream consumption.
- Developed Realtime data processing applications by using Scala and Python and implemented ApacheSparkStreaming from various streaming sources like Kafka, Flume and JMS.
- Worked on extracting and enriching HBase data between multiple tables using joins inSpark.
- Worked on writing APIs to load the processed data to HBase tables.
- Replaced the existing MapReduce programs into Spark application using Scala.
- Built on-premise data pipelines using Kafka andSparkstreaming using the feed from API streaming Gateway REST service.
- Developed the Hive UDF's to handle data quality and create filtered datasets for further processing
- Experienced in writing Sqoop scripts to import data into Hive/HDFS from RDBMS.
- Good knowledge on Kafka streams API for data transformation.
- Developed oozie workflow for scheduling & orchestrating the ETL process.
- Used Talend tool to create workflows for processing data from multiple source systems.
- Created sample flows in Talend, Stream sets with custom coded jars and analyzed the performance of Stream sets and Kafka steams.
- Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs
- Optimized Hive QL/ pig scripts by using execution engine like Tez, Spark.
- Developed Hive Queries to analyze the data in HDFS to identify issues and behavioral patterns.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Deployed applications using Jenkins framework integrating Git- version control with it.
- Participated in production support on a regular basis to support the Analytics platform
- Used Rally for task/bug tracking.
- Used GIT for version control.