Azure Data Engineer Resume
Malvern, PennsylvaniA
SUMMARY
- Around 5 years of experience in IT industry and have been active in Big Data technologies including Hadoop and Spark. Driven by curiosity, innovative thinking and pleasure of learning, developed apt solutions on performance by leveraging emerging technologies and methods. Strong believer in collaboration, teamwork, integrity and interest of the client.
- IT experience in Software Development Life Cycle (Analysis, Design, Development, Testing, Deployment and Support) using WATERFALL and AGILE methodologies.
- Experience in Big Data Technologies using Hadoop Eco System components (Spark, HDFS, MapReduce, Sqoop, Hive) in Retail, Health - care sector and Financial sectors.
- Experienced in working with Hadoop distributions predominantly Amazon EMR, Cloudera (CDH) and knowledge on Hortonworks (HDP).
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors and Tasks.
- Experienced in working with cloud services such as EMR, S3, EC2, Athena.
- Experience in ETL jobs and developing and managing data pipelines.
- Working knowledge in creating ETL jobs to load huge volumes of data into Hadoop Ecosystem and relational databases.
- Imported Avro files using Apache Kafka and did some analytics using Sparking Scala.
- Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Experience in running Hive scripts, Unix and Linux shell scripting.
- Created User Defined Functions (UDFs) in Hive.
- Experience using Splunk to monitor log files, databases, web services, and other types of monitoring end points
- Flexible with full implementation of spark jobs with PySpark API and Spark Scala API.
- Designed Hive queries to perform data analysis, data transfer and table design to load into Hadoop environment.
- Used the Spark - Cassandra Connector to load data to and from Cassandra. Real time streaming the data using Spark with Kafka.
- Developed a data pipeline using Kafka , Spark and Hive to ingest, transform and analyzing data .
- Proficient in importing / exporting data from RDBMS to HDFS using Sqoop and TDCH.
- Experience in using Airflow, Oozie schedulers and Unix scripting to implement cron jobs that execute different Hadoop actions.
- Experience in using DB2, Mainframe, SQL server, MYSQL and Teradata.
- Experience in data bricks
- Familiar with Data Extraction tools and ETL tools like Informatica, Talend, Pentaho.
- Experience in using Splunk for monitoring all the transformations.
- Experience in using HUE for analyzing the data.
- Experience in using Control-M to schedule the jobs.
- Experience with Splunk, network security, system security, and supporting Security Information and Event Management (SIEM)
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB)
- Hands on experience with ORC, and Parquet File formats.
- Experience on Migrating SQL database to Azure data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory.
- Proficient in working with Jira, Bitbucket, GIT, Bamboo and Jenkins
- Knowledge with NoSQL Databases HBase and Cassandra.
- Good analytical, communication, problem solving skills and adore learning new technical and functional skills.
- Strong believer in collaboration, teamwork, integrity and interest of the client.
- Integrated Kafka with Flume t o send data to Spark Streaming context, HDFS.
TECHNICAL SKILLS
Big Data Ecosystem: Hadoop, HDFS, MapReduce, Hive, YARN, Apache Spark, Airflow, Oozie, Zookeeper, HUE, Sqoop, TDCH, Splunk, Control-M.
NoSQL Databases: HBase, Cassandra
Hadoop Distributions: AWS (Amazon Web Services), Cloudera, Hortonworks
Programming Languages: Python, Scala, HiveQL
Scripting Languages: Shell Scripting, Java Scripting
Databases: DB2, Teradata, Snowflake, MySQL
IDE: Eclipse, PyCharm, IntelliJ
BI Tools: Tableau, Talend, Power BI
Version control tools: Bitbucket, Git, GitHub
PROFESSIONAL EXPERIENCE
Confidential, Malvern, Pennsylvania
Azure Data Engineer
Responsibilities:
- Ingested the data from various data sources like DB2 into Hive using Sqoop scripts.
- Experience in loading the data from Hive to AWS S3 and Snowflake using spark API.
- Developed ETL frameworks for data using PySpark.
- Testing the data on daily basis in between the ETL and Hive tables by using a compare tool which is implemented in spark framework using Pyspark.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL and U-SQL Azure Data Lake Analytics . Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Implemented Event Sourcing using Akka.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity
- Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Migrating Services from On-premise to Azure Cloud Environments. Collaborate with development and QA teams to maintain high-quality deployment
- Configured Azure Active Directory and managed users and groups
- Performed fine-tuning of spark applications/jobs to improve the efficiency and overall processing time for the pipelines.
- Deploying Azure Resource Manager JSON Templates from PowerShell worked on Azure suite: Azure SQL Database, Azure Data Lake, Azure Data Factory, Azure SQL Data Warehouse, Azure Analysis Service
- Experienced in writing Spark Applications in Scala and Python (Pyspark)
- Imported Avro files using Apache Kafka and did some analytics using Sparking Scala.
- Developed Spark code using Scala and Spark-SQL for faster processing and testing.
- Experience with creating custom applications within Splunk
- Used broadcast variables in Spark, effective & efficient Joins, transformations and other capabilities for data processing.
- Understands existing Informatica ETL environment, translates into new environment requirements and plans tasks to meet those requirements
- Written transformations and actions on Data Frames Data Bricks, used Spark SQL on data frames and data bricks to access hive tables into spark for faster processing of data.
- Used Spark-SQL to perform event enrichment and to prepare various levels of user behavioral summaries.
- Provide solutions for data warehousing and Informatica ETL Processes to support Data Integration and Reporting requirements
- Worked with EMR, S3, and Ec2 services in AWS cloud.
- Implemented the workflows using Oozie scheduler and Control-M to automate tasks.
- Experience in forking the jobs using Oozie in order to execute the jobs in parallel.
- Experience in integrating the code changes to Bitbucket repository and build using Bamboo CICD pipeline.
- Experience in writing shell scripts in order to display all the transformation records in Splunk dashboard.
- Experienced in writing Spark Applications in Scala and Python (Pyspark)
- Experienced working with Spark Core and Spark SQL using Python.
- Implementing the Proof of Concept (POC) for ETL Abinitio graph concepts which need to be migrated into Spark using scala and python (Pyspark).
- Interacted with the infrastructure, network, database, application and BA teams to ensure data quality and availability.
- Worked with Devops team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
Environment: AWS EMR, Spark, Python, S3, Yarn, Oozie, Snowflake, ETL, DB2, Hive, PySpark, Unix, Sqoop, Control-M, Splunk.
Confidential, San Mateo, California
Hadoop Developer
Responsibilities:
- Developed TDCH scripts for importing and exporting data into S3 and Hive.
- Developing design documents considering all possible approaches and identifying best of them.
- Responsible to manage data coming from different sources.
- Worked on Azure Active Directory
- Migrate data from traditional database systems to Azure databases
- Implementing the Proof of Concept (POC) for ETL Abinitio graph concepts which need to be migrated into Spark using scala and python (Pyspark).
- Experience in using the EMR cluster and various EC2 instance types based on requirements.
- Responsible for loading data from UNIX file systems to HDFS. Installed and configured Hive and written Hive UDFs .
- Configured Azure Active Directory and managed users and groups
- Involved in creating Hive Tables, loading with data and writing Hive queries.
- Used Bucketing and Dynamic Partitioning on Hive tables.
- Import the data from different sources like HDFS/Hive into Spark RDD. Written transformations and actions on Data Frames Data Bricks, used Spark SQL on data frames and data bricks to access hive tables into spark for faster processing of data.
- Developed Spark SQL scripts using PySpark to perform transformations and actions on RDD’s in spark for faster data Processing.
- Provide solutions for data warehousing and Informatica ETL Processes to support Data Integration and Reporting requirements
- Developed a data pipeline using Kafka, Spark Streaming and Hive to ingest the data from data lakes to Hadoop distributed file system.
- Worked on CICD pipeline, integrating code changes to Git repository and build using Jenkins.
- Worked with Devops team to Clusterize NIFI Pipeline on EC2 nodes integrated with Spark, Kafka, Postgres running on other instances using SSL handshakes in QA and Production Environments.
- Experienced in working with AWS Athena Serverless Query Services.
- Experience with cloud warehouse tools like Snowflake.
- Experienced in working with Spark Core and Spark SQL using Scala.
- Performed data transformations and analytics on large dataset using Spark .
- Implemented the workflows using Airflow scheduler to automate tasks.
Environment : Hadoop, AWS EMR, S3, Hive, Spark, Airflow, Teradata, Yarn, Unix, TDCH, Python, PySpark, Scala.
Confidential
Software Engineer
Responsibilities:
- Data analysis and generated reports which helped to improve product quality and decision making.
- Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
- Experience in importing the data from relational databases such as MySQL to HDFS and exporting the data from HDFS to relational databases using SQOOP.
- Involved in Big Data Frameworks and tools such as Hadoop, Spark, Hive.
- Used the Spark - Cassandra Connector to load data to and from Cassandra. Real time streaming the data using Spark with Kafka .
- Experience in troubleshooting the issues and failed jobs in the Hadoop cluster.
- Implemented the workflows using Apache Oozie framework to automate the tasks.
- Communicating with clients to gather the requirements.
- SQL querying and performance tuning, creating backup tables.
- Developed Web services component using XML, WSDL and SOAP with DOM parser to transfer and transform data between applications.
- Exposed various capabilities as Web Services using SOAP/WSDL.
- Used SOAP UI for testing the Restful Web services by sending and SOAP request.
- Used AJAX framework for server communication and seamless user experience.
- Mentoring and Training the new recruits.
Environment: Hadoop, Hive, SQOOP, Spark, Oozie, Cloudera Manager, Tableau
Confidential
Java Developer
Responsibilities:
- Developed the J2EE application based on the Service Oriented Architecture by employing SOAP and other tools for data exchanges and updates.
- Developed the functionalities using Agile Methodology.
- Used Apache Maven for project management and building the application.
- Worked in all the modules of the application which involved front-end presentation logic developed using Spring MVC, JSP, JSTL and JavaScript, Business objects developed using POJOs and data access layer using Hibernate framework.
- Used JAX-RS (REST) for producing web services and involved in writing programs to consume the web services with Apache CXF framework.
- Used Restful API and SOAP web services for internal and external consumption.
- Used Spring ORM module for integration with Hibernate for persistence layer.
- Involved in writing Hibernate Query Language (HQL) for persistence layer.
- Used Spring MVC, Spring AOP, Spring IOC, Spring Transaction and Oracle to create Club Systems Component.
- Wrote backend jobs based on Core Java & Oracle Data Base to be run daily/weekly.
- Coding the core modules of the application compliant with the Java/J2EE coding standards and Design Patterns.
- Written Java Script, HTML, CSS, Servlets, and JSP for designing GUI of the application.
- Worked on Service-side and Middle-tier technologies, extracting catching strategies/solutions.
- Design data access layer using Data Access Layer J2EE patterns. Implementing the MVC architecture Struts Framework for handling databases across multiple locations and display information in presentation layer.
- Used XPath for parsing the XML elements as part of business logic processing.
Environment: Java, Struts 1.2, Hibernate 3.0, JSP, JavaScript, HTML, XML, Oracle, Eclipse, JBoss Application Server, ANT, CVS, and SQL.
