Azure Data Engineer Resume
New, JerseY
SUMMARY
- Over 8+ years of experience in IT with exceptional expertise in Big Data/Hadoop ecosystem and Data Analytics techniques.
- Hands on experience working with Big Data/Hadoop ecosystem including Apache Spark, Map Reduce, Spark Streaming, PySpark, Hive, HDFS, Kafka, Sqoop, Oozie.
- Proficient in Python Scripting and worked in stats function with NumPy, visualization using Matplotlib and Pandas for organizing data.
- Experience in different Hadoop distributions like Cloudera and Horton Works Data Platform (HDP).
- In depth understanding of Hadoop Architecture including YARN and various components such as HDFS Resource Manager, Node Manager, Name Node, Data Node.
- Strong experience in core java, Scala, SQL, PL/SQL and Restful web services.
- Hands on experience in Importing and exporting data from RDBMS into HDFS and vice - versa using Sqoop.
- Experience in working with Hive data warehouse tool-creating tables, distributing data by doing static partitioning and dynamic partitioning, bucketing, and using Hive optimization techniques.
- Experience working with Cassandra and NoSQL database including MongoDB and HBase.
- Experience in tuning and debugging Spark application and using Spark optimization techniques.
- Experience in building PySpark and Spark-Scala applications for interactive analysis, batch processing and stream processing.
- Hands on experience in creating real time data streaming solutions using Apache Spark Core, Spark SQL, and Data Frames.
- Extensive knowledge in implementing, configuring, and maintaining Amazon Web Services (AWS) like EC2, S3, Redshift, Glue and Athena.
- Experience in migrating on premise infrastructure to cloud platforms like Aws/Azure/GCP/OpenStack/Pivotal Cloud Foundry (PCF) and involved in virtualization using (VMware, VMware ESX, Xen) and infrastructure orchestration using containerization technologies like Docker and Kubernetes.
- Data Ingestion to Azure Services and processing the Data in In Azure Databricks Integrating Azure Databricks with Power BI and creating dashboards.
- Experienced in data manipulation using python and python libraries such as Pandas, NumPy, SciPy and Scikit-Learn for data analysis, numerical computations, and machine learning.
- Experience in writing stored procedures and complex SQL queries using relational database like Oracle, SQL Server, and MySQL
- Site Reliability Engineering Responsibilities for Kafka platform that scales 2 GB/Sec and 20 million messages/sec.
- Designed and implemented by configuring Topics in new Kafka cluster in all environments.
- Experience in writing queries using SQL, experience in data integration and performance training.
- Developed various shell scripts and python scripts to automate Spark jobs and Hive scripts.
- Actively involved in all phases of data science project life cycle including Data collection, Data Pre-Processing, Exploratory Data Analysis, Feature Engineering, Feature selection and building Machine learning Model pipeline.
- Knowledge on orchestration using Apache NiFi
- Integrated with UI Layer using HTML, Ajax, JavaScript.
- Hands on Experience in using Visualization tools like Tableau, Power BI.
- Successfully secured the Kafka cluster with Kerberos.
- Experience in working with GIT, Bitbucket Version Control System.
- Extensive experience working in a Test-Driven Development and Agile-Scrum Development.
- Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player. Ability to quickly adapt new environment and technologies.
TECHNICAL SKILLS
Big Data Technologies: Spark, Spark SQL, Spark Streaming, Hive, Impala, Hue
Hadoop Ecosystem: Hadoop, MapReduce, Yarn, HDFS, Pig, Oozie, Zookeeper
Cloud Services: Azure Data Lake Storage Gen 2, Azure Data Factory, Blob storage, Azure SQL DB, Databricks, Azure Event Hubs, AWS RDS, Amazon SQS, Amazon S3, AWS EMR, Lambda, AWS SNS, Big Query, Data Proc, Data Flow.
Databases: MySQL, SQL Server, Oracle, MS Access, Teradata, and Snowflake
Data Ingestion: Sqoop, Flume, NiFi, Kafka.
Programming Languages: Python, PL/SQL, SQL, Scala, C, C#, C++, T-SQL, Power Shell Scripting, JavaScript
NoSQL Data Bases: MongoDB, Cassandra DB, HBase
Visualization & ETL tools: Tableau, PowerBI, Informatica, Talend, SSIS, and SSRS
Development Strategies: Agile, Lean Agile, Pair Programming, Waterfall, and Test-Driven Development.
Version Control & Containerization tools: Jenkins, Git, and SVN
Operating Systems: Unix, Linux, Windows, Mac OS
Monitoring tool: Apache Airflow
PROFESSIONAL EXPERIENCE
Confidential, New Jersey
Azure Data Engineer
Responsibilities:
- Develop, design data models, data structures and ETL jobs for data acquisition and manipulation purposes.
- Develop deep understanding of the data sources, implement data standards, maintain data quality, and master data management.
- Expert in developing JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data.
- Expert in using Databricks with Azure Data Factory (ADF) to compute large volumes of data.
- Performed ETL operations in Azure Databricks by connecting to different relational database source systems using JDBC connectors.
- Developed Python scripts to do file validations in Databricks and automated the process using ADF.
- Developed an automated process in Azure cloud which can ingest data daily from web service and load into Azure SQL DB.
- Used Spark cluster and Cloud Dataflow on GCP to compare the efficiency of a POC on a developed pipeline.
- Developed Streaming pipelines using Azure Event Hubs and Stream Analytics to analyze data for dealer efficiency and open table counts for data coming in from IOT enabled poker and other pit tables.
- Analyzed data where it lives by Mounting Azure Data Lake and Blob to Databricks.
- Used Logic App to take decisional actions based on the workflow.
- Developed custom alerts using Azure Data Factory, SQLDB and Logic App.
- Developed Databricks ETL pipelines using notebooks, Spark Data frames, SPARK SQL and python scripting.
- Monitoring end to end integration using Azure monitor.
- Implementation of data movements from on-premises to cloud in Azure.
- Develop batch processing solutions by using Data Factory and Azure Data bricks
- Implement Azure Data bricks clusters, notebooks, jobs, and auto scaling.
- Design for data auditing and data masking
- Creating snowflake diagrams in data warehouse for logical arrangements.
- Design for data encryption for data at rest and in transit
- Design relational and non-relational data stores on Azure
- Working and analyzing the data with Adobe analytics in cloud computing.
- Used Python and Shell scripts to Automate Teradata ELT and Admin activities.
- Performed Application-level DBA activities creating tables, indexes, and monitored and tuned Teradata BETQ scripts using Teradata Visual Explain utility.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Design Setup maintain Administrator the Azure SQL Database, Azure Analysis Service, Azure SQL Data warehouse, Azure Data Factory, Azure SQL Data warehouse
- Performed ETL operation using SSIS and loaded the data into Secure DB.
- Good hands-on experience in Data Vault concepts, data models, well versed understanding and implementation on Data warehousing concepts/Data Vault.
- Designed, reviewed, and created primary objects such as views, indexes based on logical design models, user requirements and physical constraints
- Worked with stored procedures for data set results for use in Reporting Services to reduce report complexity and to optimize the run time.
- Exported reports into various formats (PDF, Excel) and resolved formatting issues.
- Designed the packages in order to extract data from SQL DB, flat files and loaded into Oracle database.
- Performance tuning, monitoring, UNIX shell scripting, and physical and logical database design.
- Developed UNIX scripts to automate different tasks involved as part of loading process.
- Worked on Tableau software for the reporting needs.
- Worked on creating few Tableau dashboard reports, Heat map charts and supported numerous dashboards, pie charts and heat map charts that were built on Teradata database.
- Implement Copy activity, Custom Azure Data Factory Pipeline Activities.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Collaborate with application architects on infrastructure as a service (IaaS) application to Platform as a Service (PaaS).
Environment: Azure Data Factory, Tableau, Shell Scripting, Teradata, python scripting, Azure data bricks, Azure data lake storage, Blob storage, Azure SQL Database, snowflake Azure synapse analytics, Azure synapse workspace, ELK Stack, MS-Azure, Azure SQL Database, Azure functions Apps, Azure Data Lake, SQL server.
Confidential, California
Big Data Engineer
Responsibilities:
- Developed Spark Applications by using Python and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Performed tuning of Spark Applications to set batch interval time and correct level of Parallelism and memory tuning.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in real time and persist it to Cassandra.
- Scheduling Spark/Scala jobs using Oozie workflow in Hadoop Cluster and generated detailed design documentation for the source-to-target transformations.
- Developed Kafka consumer s API in python for consuming data from Kafka topics.
- Used Kafka to consume XML messages and Spark Streaming to process the XML file to capture UI updates.
- Valuable experience on practical implementation of cloud-specific technologies including IAM, Amazon Cloud Services like Elastic Compute Cloud (EC2), Elastic Cache, Simple Storage Services (S3), Cloud Formation, Virtual Private Cloud (VPC), Route 53, Lambda, Glue, EMR.
- Migrated an existing on-premises application to AWS and used AWS services like EC2 and S3 for small data sets processing and storage.
- Loaded data into S3 buckets using AWS Lambda Functions, AWS Glue and PySpark and filtered data stored in S3 buckets using Elasticsearch and loaded data into Hive external tables. Maintained and operated Hadoop cluster on AWS EMR.
- Configured Snow pipe to pull the data from S3 buckets into Snowflakes table and stored incoming data in the Snowflakes staging area.
- Created live real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Worked on Amazon Redshift for shifting all Data warehouses into one Data warehouse.
- Designed columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Designed, developed, deployed, and maintained MongoDB.
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming.
- Worked extensively with Sqoop for importing and exporting the data from HDFS to Relational Database systems (RDBMS) and vice-versa.
- Written several Map reduce Jobs using PySpark, NumPy and used Jenkins for Continuous integration.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL, and a variety of portfolios.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with HiveQL queries.
- Worked on cloud deployments using Maven, Docker, and Jenkins.
- Experience in using Avro, Parquet, RCFile and JSON file formats, developed UDF in Hive.
- Worked on Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV, and generated Bags for processing using PIG etc.
- Generated various kinds of reports using Power BI and Tableau based on Client specification.
Environment: Spark, Spark-Streaming, Spark SQL, AWS EMR, S3, EC2, MapR, HDFS, Hive, Apache Kafka, Sqoop, Python, Scala, PySpark, Shell scripting, Linux, MySQL, NoSQL, Jenkins, Eclipse, Git, Oozie, Tableau, Cassandra, and Agile Methodologies.
Confidential, Texas
Data Engineer
Responsibilities:
- Developed a data platform from scratch and took part in the requirement gathering and analysis phase of the project in documenting the business.
- Worked on designing tables in Hive and MySQL using SQOOP and processing data like importing and exporting databases to the HDFS.
- I was involved in processing large datasets of different forms, including structured, semi-structured, and unstructured.
- Developed REST API's using Python with Ask and Django framework and integrated various data sources including Java, JDBC, RDBMS, Shell Scripting, Spreadsheets, and Text Less.
- Worked with the Hadoop architecture and the daemons of Hadoop, including Name-Node, Data Node, Job Tracker, Task Tracker, and Resource Manager.
- AWS data pipeline was used for data extraction, transformation, and loading from homogeneous or heterogeneous data sources, and the Python matplot library was used to create various graphs for business decision-making.
- Developed scripts to load data into hive from HDFS and was involved in ingesting data into the data warehouse using various data loading techniques.
- Scheduled jobs using crontab, run deck, and control-m.
- Built Cassandra queries for performing various CRUD operations like create, update, read, and delete, and used Bootstrap as a mechanism to manage and organize the html page layout.
- Created User Interface (UI) using JavaScript, bootstrap, Cassandra with MySQL, and HTML5/CSS and developed entire frontend and backend modules using Python on the Django Web Framework.
- Importing and exporting data jobs to perform operations like copying data from HDFS and to HDFS using Sqoop and developed Spark code and Spark-SQL/Streaming for faster testing and processing of Analyzed SQL scripts and designed the solutions to be implemented using pyspark.
- Used JSON and XML for serialization and de-serialization to load JSON and XML data into Hive tables.
- Used SparkSQL to load JSON data and create a Schema RDD and load it into Hive Tables and handle structured data using
- Using pyspark, create data processing tasks such as reading data from external sources, merging data, performing data enrichment, and loading in target data.
- Support for Amazon AWS S3 and RDS has been added to host static and media content, as well as the database, in the Amazon Cloud.
- Worked in the development of applications, especially in the LINUX environment, and am familiar with all its commands.
- I worked on the Jenkins continuous integration tool for deployment of projects and deployed the projects into Jenkins using the GIT version control system.
- Managed data imports from various data sources, transformed data using Hive, Pig, and Map-Reduce, and loaded data into
- The Oozie Workflow engine was used to run multiple Hive and Pig jobs that ran independently based on time and data availability, and Oozie Workflow was developed to run jobs based on data availability.
- To achieve the continuous delivery goal in a highly scalable environment, we used Docker coupled with a load-balancing tool.
- Used mongo dB to store data in JSON format and developed and tested many features of the dashboard using Python, Bootstrap, CSS, and JavaScript.
Environment: Hadoop, Hive, Sqoop, Pig, Java, Django, Flask, XML, MySQL, MS SQL Server, Linux, Shell Scripting, mongo dB, SQL, Python 3.3, Django, HTML5/CSS, Cassandra, JavaScript, PyCharm, GIT, Linux, Shell Scripting, restful, Docker, Jenkins, JIRA, jQuery, MySQL, Bootstrap, HTML5, CSS, AWS, EC2, S3.
Confidential
Data Engineering Analyst
Responsibilities:
- Involved in importing data from Microsoft SQL server, MySQL, Teradata. into HDFS using Sqoop.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS.
- Used Hive to analyze the partitioned and bucked data to compute various metrics of reporting.
- Involved in creating Hive tables loading data, and writing queries that will run internally in MapReduce
- Involved in creating Hive External tables for HDFS data.
- Solved performance issues in Hive and Pig Scripts with understanding of Joins, Group and Aggregation and perform the MapReduce jobs.
- Used Spark for transformations, event joins and some aggregations before storing the data into HDFS.
- Troubleshoot and resolve data quality issues and maintain elevated level of data accuracy in the data being reported.
- Analyze the large amount of data sets to determine optimal way to aggregate.
- Worked on the Oozie workflow to run multiple Hive and Pig jobs.
- Worked on creating Custom Hive UDF's.
- Developed automated shell script to execute Hive Queries.
- Involved in processing ingested raw data using Apache Pig.
- Monitored continuously and managed the Hadoop cluster using Cloudera manager.
- Worked on different file formats like JSON, AVRO, ORC, and Parquet and Compression like Snappy, zlib, ls4, etc.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs.
- Gained Knowledge in creating a Tableau dashboard for reporting analyzed data.
- Expertise with NoSQL databases like HBase.
- Experienced in managing and reviewing the Hadoop log files.
- Used GitHub as a repository for committing code and retrieving it and Jenkins for continuous integration.
Environment: HDFS, MapReduce, Sqoop, Hive, Pig, Spark, Oozie, MySQL, Eclipse, GitHub, Jenkins.
Confidential
Java Developer
Responsibilities:
- Involved in various phases of the Software Development Life Cycle (SDLC) of the application like Requirement gathering, Design, Analysis, and Code development.
- Developing front end of application using HTML, CSS, Backbone.js, JavaScript, jQuery.
- Design, develop and implement MVC Pattern based Keyword Driven automation testing framework utilizing Java, JUnit and SeleniumWebDriver.
- Used automated scripts and performed functionality testing during the various phases of the application development using Selenium.
- Used Angular.js framework where data from backend is stored in model and populated it to UI.
- Prepared user documentation with screenshots for UAT (User Acceptance testing)
- Developed and implemented the MVC Architectural Pattern using Struts Framework including JSP, Servlets, EJB, Form Bean and Action classes.
- Helped developed page templates using Struts Tiles framework.
- Implemented Struts Validation Framework for server-side validation.
- Developed JSP's with Custom Tag Libraries for control of the business processes in the middle-tier and was involved in their integration.
- Implemented Struts Action classes using Struts controller component.
- Developed Web services (SOAP) through WSDL in Apache Axis to interact with other components.
- Implemented Java/J2EE Design patterns like Business Delegate and Data Transfer Object (DTO), Data Access Object and Service Locator.
Environment: Java1.5, JSP, JDBC, Struts 1.2, Hibernate 3.0, Design Patterns, XML, Oracle, PL/SQL Developer, Web services, SOAP, XSLT, Jira.