Azure Data Engineer Resume
Boston, MA
SUMMARY
- 7+ Years of professional IT experience in all phases of Software Development Life Cycle including hands on experience in Big Data Analytics.
- Hands on experience using Hadoop tools like HDFS, Hive, Apache Spark, Apache Sqoop, Flume, Oozie, Apache Kafka, Apache storm, Yarn, Impala, Zookeeper, Hue.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Good understanding of Big Data Hadoop and Yarn architecture along with various Hadoop Demons such as Job Tracker, Task Tracker, Name Node, Data Node, Resource/Cluster Manager, and Kafka (distributed stream-processing) .
- Experience in Database Design and development with Business Intelligence using SQL Server 2014/2016, Integration Services (SSIS), DTS Packages, SQL Server Analysis Services (SSAS), DAX, OLAP Cubes, Star Schema and Snowflake Schema.
- Strong skills in visualization tools Power BI, Confidential Excel - formulas, Pivot Tables, Charts and DAX Commands.
- Experience in analyzing data using HiveQL, and MapReduce Programs.
- Experienced in ingesting data into HDFS from various Relational databases like MYSQL, Oracle, DB2, Teradata, Postgres using sqoop.
- Experienced in importing real time streaming logs and aggregating the data to HDFS using Kafka and Flume.
- Well versed with various Hadoop distributions which include Cloudera (CDH), Hortonworks (HDP), Azure HD Insight.
- Extending HIVE and PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig.
- Experience working on NoSQL Databases like HBase, Cassandra and MongoDB.
- Experience in Python, Scala, shell scripting, and Spark.
- Experience with Testing Map Reduce programs using MRUnit, Junit and EasyMock.
- Experience on ETL methodology for supporting Data Extraction, transformations and loading processing using Hadoop.
- Worked on data visualization tools like Tableau and also integrated the data using ETL tool Talend.
- Hands on development experience with JAVA, Shell Scripting, RDBMS, including writing complex SQL queries, PL/SQL, views, stored procedure, triggers, etc.
- Passionate about working on the most cutting-edge Big Data technologies.
- Willing to update my knowledge and learn new skills according to business requirement.
TECHNICAL SKILLS
Hadoop/Big Data: HDFS, MapReduce, Spark, Yarn, Kafka, Apache NiFi, PIG, HIVE, Sqoop, Storm, Flume, Oozie, Impala, HBase, Hue, Zookeeper.
Programming Languages: Java, PL/SQL, Python, HiveQL, Scala, SQL, Azure Power Shell.
Development Tools: Eclipse, SVN, Git, Ant, Maven, SOAP UI
Databases: Oracle 11g/10g/9i, Teradata, MS SQL
No SQL Databases: Apache HBase, Mongo DB, Cassandra
Distributed platforms: Hortonworks, Cloudera, Azure HD Insight
Operating Systems: UNIX, Ubuntu Linux and Windows 00/XP/Vista/7/8
Other Technologies: Azure Data lake, Data factory, Azure Databricks, Azure SQL database, Azure SQL Datawarehouse
PROFESSIONAL EXPERIENCE
Confidential, Boston, MA
Azure Data Engineer
Responsibilities:
- Analyze, design and build Modern data solutions using Azure PaaS service to support visualization of data. Understand current Production state of application and determine the impact of new implementation on existing business processes.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Created Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- To meet specific business requirements wrote UDF’s in Scala and Pyspark.
- Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
- Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse.
- Hands-on experience on developing SQL Scripts for automation purpose.
- Created Build and Release for multiple projects (modules) in production environment using Visual Studio Team Services (VSTS).
- Wrote AzurePower shellscripts to copy or move data from local file system to HDFS Blob storage.
- Worked extensively with Dimensional modeling, Data migration, Data cleansing, ETL Processes for data warehouses.
- Worked in Agile Methodology and used JIRA for maintain the stories about project.
- Involved in gathering the requirements, designing, development and testing.
Environment: Hadoop, Azure Data Factory, Azure Data Lake, Azure Storage, Azure SQL, Azure DataWarehouse, Azure Databricks, Azure Power Shell, Map Reduce, Hive, Spark, Python, Yarn, Tableau, Kafka, Sqoop, Scala, HBase.
Confidential, Westport, CT
BigData Engineer
Responsibilities:
- Worked on analyzingHadoopcluster and different big data analytical and processing tools including Sqoop, Hive, Spark, Kafka and Pyspark.
- Worked on MapR platform team for performance tuning of hive and spark jobs of all users.
- Using Hive TEZ engine to increase the performance of the applications.
- Working on incidents created by users for platform team on hive and spark issues by monitoring Hive and Spark logs and fixing it or else by raising MapR cases.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
- Worked on Hadoop Data Lake for ingesting data from different sources such as Oracle and Teradata through INFOWORKS ingestion tool.
- Worked on ARCADIA for creating analytical views on top of tables as if the batch is loading also no issue in reporting or table locks as it will point to arcadia view.
- Worked on Python API for converting assigned group level permissions to table level permission using MapR ace by creating a unique role and assigning through EDNA UI.
- Queried and analyzed data fromCassandrafor quick searching, sorting and grouping throughCQL.
- Migrating various Hive UDF's and queries into Spark SQL for faster requests.
- Configured to receive real time data from the ApacheKafka and store the stream data to HDFS using Kafka connect.
- Hands on experience in Spark using Scala and Python creating RDD's, applying operations -Transformation and Actions.
- Extensively perform complex data transformations in Spark using Scala language.
- Involved in converting Hive/SQL queries into Spark transformations using Scala.
- Used Pyspark and Scala languages to process the data.
- Used Bitbuket and Git repositories.
- Used text, AVRO, ORC and Parquet file formats for Hive tables.
- Experienced Scheduling jobs using Crontab.
- Used Sqoop to import data from Oracle, Teradata to Hadoop.
- Created Master Job Sequences for integration, (ETL Control) logic to capture job success, failure, error and audit, information for reporting.
- Used TES Scheduler engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Spark, Kafka and Sqoop.
- Experienced in creating recursive and replicated joins in hive.
- Experienced in developing scripts for doing transformations using Scala.
- Written Map Reduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase-Hive Integration.
- Experienced in creating the shell scripts and made jobs automated.
Environment: HDFS, Hadoop, Python, Hive, Sqoop, Flume, Spark, Map Reduce, Scala, Oozie, YARN, Tableau, Spark-SQL, Spark-MLlib, Impala, Nagios, UNIX Shell Scripting, Zookeeper, Kafka, Agile Methodology, SBT.
Confidential, Eden Prairie, MN
Hadoop Engineer
Responsibilities:
- Involved in importing the real time data to Hadoop using Kafka and implemented the Oozie job for daily data.
- Loaded the data from Teradata to HDFS using Teradata Hadoop connectors.
- Import the data from different sources like HDFS/HBase into Spark RDD
- Developed Spark scripts by using Python shell commands as per the requirement
- Issued SQL queries via Impala to process the data stored in HDFS and HBase.
- Used the Spark - Cassandra Connector to load data to and from Cassandra.
- Used Restful Web Services API to connect with the MapRtable. The connection to Database was developed through restful web services API.
- Involved in developing Hive DDLs to create, alter and drop Hive tables and storm, & Kafka.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Experience in data migration from RDBMS to Cassandra. Created data-models for customer data using the Cassandra Query Language.
- Responsible for building scalable distributed data solutions using Hadoop cluster environment with Horton works distribution
- Involved in developing Spark scripts for data analysis in both Python and Scala. Designed and developed various modules of the application with J2EE design architecture.
- Implemented modules using Core Java APIs, Java collection and integrating the modules.
- Experienced in transferring data from different data sources into HDFS systems using Kafka producers, consumers and Kafka brokers
- Installed Kibana using salt scripts and build custom dashboards that can visualize aspects of important data stored by Elastic search.
- Used File System Check (FSCK) to check the health of files in HDFS and used Sqoop to importdata from SQL server toCassandra
- Streaming the transactionaldatatoCassandrausing Spark Streaming/Kafka
- Implemented a distributed messaging queue to integrate withCassandrausing Apache Kafka and Zookeeper.
- Written ConfigMap and Daemon set files to install File beats on Kubernetes PODS to send the log files to Log stash or Elastic search to monitor the different type of logs in Kibana.
- Created Database on Influx DB also worked on Interface, created for Kafka also checked the measurements on Databases.
- Installed Kafka manager for consumer lags and for monitoring Kafka Metrics also this has been used for adding topics, Partitions etc.
- Successfully Generated consumer group lags from Kafka using their API.
- Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka.
- Involved in creating Hive tables, and loading and analyzing data using hive queries.
- Developed multiple MapReduce jobs in java for data cleaning and pre-processing. Loading data from different source (database & files) into Hive using Talend tool. Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Implemented Flume, Spark, and Spark Streaming framework for real time data processing.
Environment: Hadoop, Python, HDFS, Hive, Scala, MapReduce, Agile, Cassandra, Kafka, Storm, AWS, YARN, Spark, ETL, Teradata, NoSQL, Oozie, Java, Cassandra, Talend, LINUX, Kibana, HBase
Confidential
Hadoop Developer
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs for data cleaning and preprocessing.
- Involved in creating Hive tables, writing complex Hive queries to populate Hive tables.
- Load and transform large sets of structured, semi structured and unstructured data.
- Used Hive to analyze the partitioned and bucketeddataand compute various metrics for reporting on the dashboard.
- Developed and ConfiguredKafka brokersto pipeline server logs data into Spark streaming.
- Worked on Spark RDD transformations to map business analysis and apply actions on top of transformations.
- Experienced in working with Spark eco system using Spark SQL and Scala on different formats like Text file, Avro, Parquet files.
- Optimized Hive QL Scripts by using execution engine like Tez.
- Wrote complex Hive queries to extractdatafrom heterogeneous sources (DataLake) and persist thedatainto HDFS.
- Implemented SQOOP for large dataset transfer between Hadoop and RDBMS.
- Used different file formats like Text files, Avro, Parquet and ORC.
- Worked with different File Formats like text file, Parquet for HIVE querying and processing based on business logic.
- Used JIRA for creating the user stories and creating branches in the bitbucket repositories based on the story.
- Knowledge on creating various repositories and version control using GIT.
- Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
Environment: Spark, Scala, Python, Hadoop, MapReduce, CDH, Cloudera Manager, Control M Scheduler, Shell Scripting, Agile Methodology, JIRA, Git, Tableau.