Senior Azure Data Engineer Resume
Arlington, TX
SUMMARY
- 7.5 years of professional software development experience with expertise in Big Data, Hadoop Ecosystem, Cloud Engineering, Data Warehousing.
- Experience in large scale application development using Big Data ecosystem - Hadoop (HDFS, MapReduce, Yarn), Spark, Hive, Impala, HBase, Airflow, Oozie, Zookeeper, AWS,Azure.
- Experience in building data pipelines using Azure Data Factory, Azure Databricks, and loading data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and grant database access.
- Good experience with Azure services like HDInsight, Stream Analytics, Active Directory, Blob Storage, Cosmos DB, Storage Explorer.
- Extensive experience in Azure Cloud Services (PaaS & IaaS), Storage, Data-Factory, Data Lake (ADLA &ADLS), Active Directory, Synapse, Logic Apps, Azure Monitoring, Key Vault, and SQL Azure.
- Experience in Azure DevOps process in creating release pipelines and deployment of different components such as ADF and USQL into higher environments.
- Sound Experience with AWS services like Amazon EC2, S3, EMR, Amazon RDS, VPC, Amazon Elastic Load Balancing, IAM, Auto Scaling, Cloud Front, CloudWatch, and Lambda to triggerresources.
- Strong Hadoop and platform support experience with all the entire suite of tools and services in major Hadoop Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- Proficient in handling and ingesting terabytes of Streaming data (Spark streaming, Strom), Batch Data, Automation and Scheduling (Oozie, Airflow).
- Expert in Hadoop and Big data ecosystem including Hive, HDFS, Spark, Kafka, MapReduce, Sqoop, Oozie and Zookeeper
- Experience creating Web-Services with the Python programming language.
- Extensive experience in data ingestion technologies like Flume, Kafka, Sqoop and NiFi
- Utilize Flume, Kafka and NiFi to gain real-time and near real-time streaming data in HDFS from different data sources
- Profound knowledge in developing production-ready Spark applications using Spark Components like Spark SQL, MLlib, GraphX, DataFrames, Datasets, Spark-ML and Spark Streaming.
- Expertise in ingesting and storing the stream data to HDFS and process it using Spark.
- Strong working experience with SQL and NoSQL databases (CosmosDB,AWSDB, HBase, Cassandra), data modeling, tuning, disaster recovery, backup and creating data pipelines.
- Experienced in scripting with Python (PySpark), Scala and Spark-SQL for development, aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet, ORC.
- Great experience in data analysis using HiveQL, Hive-ACID tables, Pig Latin queries, custom MapReduce programs and achieved improved performance.
- Worked with spark to consume data from Kafka and convert that to common format using Scala
- Experience in ELKstack to develop search engines on unstructured data within NoSQL databases in HDFS.
- Extensive knowledge in all phases of Data Acquisition, Data Warehousing (gathering requirements, design, development, implementation, testing, and documentation), Data Modeling (analysis using Star Schema and Snowflake for Fact and Dimensions Tables), Data Processing and Data Transformations (Mapping, Cleansing, Monitoring, Debugging, Performance Tuning and Troubleshooting Hadoop clusters).
- Experience in monitoring document growth and estimating storage size for large AWSDB clusters as part of the data life cycle management.
- Designed SQL, SSIS, and Python based batch and real-time ETL pipelines to extract data from transactional and operational databases and load the data into data warehouses.
- Implemented CRUD operations using Cassandra Query Language (CQL), analyze the data from Cassandra tables for quick searching, sorting, and grouping on top of the Cassandra File System.
- Hands-on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in AWSDB.
- Good knowledge in understanding the security requirements like Azure Active Directory, Sentry, Ranger, and Kerberos authentication and authorization infrastructure.
- Expertise in creating Kubernetes cluster with cloud formation templates and PowerShell scripting to automate deployment in a cloud environment.
- Sound knowledge in developing highly scalable and resilient Restful APIs, ETL solutions, and third-party integrations as part of Enterprise Site platform using Informatica.
- Experience in using bug tracking and ticketing systems such as Jira, and Remedy, used Git and SVN for version control.
- Highly involved in all facets of SDLC using Waterfall and Agile Scrum methodologies.
- Experience in designing interactive dashboards, reports, performing ad-hoc analysis, and visualizations using Tableau, Power BI, Arcadia, and Matplotlib.
- Involved in migration of the legacy applications to cloud platform using DevOps tools like GitHub, Jenkins, JIRA, Docker, and Slack.
- Collaborate with business, production support, engineering team regularly for diving deep on data, effective decision making and to support analytics platforms.
- Coordinate with cross-functional teams to execute short and long-term product delivery strategies, with a successful track record of implementing best business practices.
- Good communication and strong interpersonal and organizational skills with the ability to manage multiple projects. Always willing to learn, adopt new technologies.
TECHNICAL SKILLS
Big Data Ecosystem: HDFS, Yarn, MapReduce, Spark, Hive, Airflow, Sqoop, HBase, Oozie, Sentry, Ranger
Hadoop Distributions: Apache Hadoop 2.x/1.x, Cloudera CDP, Hortonworks HDP
Scripting Languages: Python, Java, Scala, R, PowerShell Scripting, Pig Latin, HiveQL.
Cloud Environment: Amazon AWS - EMR, EC2, EBS, RDS, S3, Athena, Glue, Elasticsearch, Lambda, SQS, DynamoDB, Redshift, ECS, QuickSight, Kinesis Microsoft Azure - Databricks, Data Lake, Blob Storage, Azure Data Factory, SQL Database, SQL Data Warehouse, Cosmos DB, Azure Active Directory)
Data Workflow: Sqoop, Flume, Kafka
NoSQL Database: Cassandra, Redis, AWSDB, Neo4j
Database: MySQL, Oracle, Teradata, MS SQL SERVER, PostgreSQL, DB2
ETL/BI: Snowflake, Informatica, Tableau, Power BI
Operating systems: Linux (Ubuntu, Centos, RedHat), Windows (XP/7/8/10/11)
Version Control: Azure DevOps, Git, SVN, Bitbucket
Others: Machine learning, NLP, Spring Boot, Jupyter Notebook, Docker, Kubernetes, Jenkins, Ansible, Splunk, Jira
PROFESSIONAL EXPERIENCE
Senior Azure Data Engineer
Confidential, Arlington, TX
Responsibilities:
- Extract, Transform, and Load data from various sources to Azure Data Storage services using a combination of Azure Data factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
- Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in Azure Databricks with PySpark.
- Migrating data from on-prem SQL server to Cloud databases (Azure Synapse Analytics (DW) & Azure SQL DB).
- Created Azure Data Lake Storage (ADLS) (Gen 1 and Gen 2) and Ingested the Data from Flat files, CSV Files, Json Files and On-Premises Database Tables using Azure Data Factory V2(ADF)
- Implemented Spark using Python and SparkSQL for faster testing and processing of data.
- Involved in converting Hive/SQ L queries into Spark transformations using Spark RDDs.
- Responsible for ingesting data from various source systems (RDBMS, Flat files, Big Data) into Azure (Blob Storage) using framework model.
- Move Data from Teradata, Oracle, Snowflake, SQL Server to ADLS gen 2.
- Automize the Power BI reports using Azure Data Factory (ADF) trigger pipelines when source data is updated based on the change log.
- Good understanding of Spark Architecture including spark core, spark S QL, DataFrame, Spark streaming, Driver Node, Worker Node, Stages, Executors and Tasks, Deployment modes, the Execution hierarchy, fault tolerance, and collection.
- Developed Talend jobs to populate the claims data to data warehouse using Star schema, Snowflake schema, Hybrid Schema (depending on the use case).
- Used Hadoop scripts for HDFS (Hadoop Distributed File System) data loading and manipulation for few of the source systems.
- Involved into Application Design and Data Architecture using Cloud and Big Data solutions on Azure.
- Leading the effort for migration of Legacy-system to Microsoft Azure cloud-based solution. Re-designing the Legacy Application solutions with minimal changes to run on cloud platform.
- Worked on building the data pipeline using Azure Service like Data Factory to load the data from Legacy SQL server to Azure Data Base using Data Factories, API Gateway Services, SSIS Packages, Talend Jobs, custom .Net and Python codes.
- The Databricks platform follows best practices for securing network access to cloud applications.
- Hands on experience working on creating delta lake tables and applying partitions for faster querying.
- Extensive knowledge on performance tuning of streaming jobs using DStreams.
- Expert in performance tuning of spark jobs by allocation right memory, executors and cores and not over burdening the cluster.
- Implemented SQLAlchemy which is a python library for complete access over SQL.
- Dealt with Python Open stack API's, used Python scripts to update content in the database and manipulate files.
- Developed Pyspark script to perform ETL using glue job, where the data is extracted from S3 using crawler and creating a data catalog to store the metadata.
- Designed and developed an entire module in python and deployed in AWS GLUE using Pyspark library and Python.
- Performed data validation by flattening the files and automate record wise counts, datatypes between the source system and destination landing.
- Performed ETL operations using Python, Spark SQL, S3 and Redshift on terabytes of data to obtain customer insights
- Involved in the data support team as role of bug fixes, schedule change, memory tuning, schema changes loading the historic data.
- Developed Oozie workflows for scheduling and orchestrating the ETL process. Involved in writing Python scripts to automate the process of extracting weblogs using Airflow DAGs
- Worked on both Agile and Kanban methodologies.
- Primarily responsible for creating new Azure Subscriptions, Data factories, VNets, Subnets, SQL Azure Instances, SQL Azure DW instances, HD Insight clusters and installing DMGs on Self Hosted VMs to connect to on premise servers.
- Worked on integrating GIT into the continuous Integration (CI) environment along with Jenkins.
- Using JIRA for issues and project tracking, TFS for version control, and Control-M for scheduling the jobs.
Environment: Azure Data Factory, Azure SQL, Azure Databricks, Azure DW, BLOB storage, Spark, Spark SQL, Python, ControlMscheduler,Kafka,PySpark.
Big Data/Elastic search Engineer
Confidential, Plano, TX
Responsibilities:
- Responsible for design, installation, and maintenance All ELK and Kafka services
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in Scala for data cleaning and preprocessing.
- Experienced Good understanding of NoSQL databases and hands on work experience in writing applications No SQL Databases HBase, Cassandra and AWSDB.
- Worked with various HDFS file formats like Parque, IAM, Json for serializing and deserializing.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Created Data Pipeline using Processor Groups and multiple processors using Apache NiFi for Flat File,AWS
- Experience as a Team player for projects and systems migrations and migrating from old version of elastic and Kafka to new version, excellent customer facing skills. ensuring the platforms remain stable, reliable, and available. Responsibilities · Production support of Kafka and Elasticsearch Developing various templates to Manage Multiple Elastic indexes, index patterns, users and roles
- Experienced with the Scala, Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark -SQL, Pair RDD's, Spark YARN.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data bricks cluster.
- Migrated Map reduce jobs to Spark jobs to achieve better performance.
- Exploring new elastic components which can improve existing system like elastic cloud enterprise to centralize all clusters and playbooks and Ansible and machine learning.
- Install, configure, administer, and support multiple Kafka clusters and Elasticsearch clusters, Perform maintenance and troubleshooting, capacity planning and growth projections
- Experience with analytical reporting and facilitating data for Quicksight and Tableau dashboards.
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and developed join data set scripts using HIVE join operations.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, DevOps, Spark YARN.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing
- Written and implemented custom UDF's in Pig for data filtering
- Using Spark Dataframe API in Scala for analyzing data.
- Good experience in using Relational databases Oracle, MY SQL, SQL Server and PostgreSQL
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semistructured data coming from various sources.
- Designed the data models to be used in data intensive AWS Lambda applications which are aimed to do complex analysis creating analytical reports for end-to-end traceability, lineage, definition of Key Business elements from Aurora.
- Deployed the project on Amazon EMR with S3 connectivity for setting backup storage.
- Conducted ETL Data Integration, Cleansing, and Transformations using AWS Glue Spark script.
- Worked on AWS Lambda functions in python for AWS Lambda which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Responsible for developing data pipeline using Spark, Scala, Apache Kafka to ingestion the data from CSL source and store in HDFS protected folder.
- Implemented many Kafka ingestion jobs to consume the real-time data processing and batch processing.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS and worked extensively with Sqoop for importing metadata from Oracle.
- Exposure to Spark, Spark Streaming, Spark MLlib, snowflake, Scala and Creating the Data Frames handled in Spark with Scala.
Environment: Elastic, Logstash, Kibana, Kafka, Confluent, Ksql, Ambari, Hadoop, MySQL, Couchbase, GitHub, Tableau, Linux, Hortonworks, NIFI, AWS S3.
Hadoop/Python Developer
Confidential, Englewood, CO
Responsibilities:
- Responsible to collect, clean, and store data for analysis using Sqoop, Spark, HDFS
- Used Kafka and Spark framework for real time and batch data processing
- Ingested large amount of data from different data sources into HDFS using Kafka
- Implemented Spark using Scala and performed cleansing of data by applying Transformations and Actions
- Used Case Class in Scala to convert RDD’s into Data Frames in Spark
- Designed the front end of the application using Python on Django Web Framework, HTML, CSS, JSON and jQuery
- Processed and Analyzed data in stored in HBase and HDFS
- Developed real time data processing applications by using Scala and Python and implemented Apache Spark Streaming from various streaming sources like Kafka
- Prepared queries for Databricks using Python, Spark, SQL and NoSQL.
- Worked on advanced analytics using Python with Modules like Pandas, NumPy and Matplotlib
- Developed Spark jobs using Scala on top of Yarn for interactive and Batch Analysis.
- Developed Unix shell scripts to load large number of files into HDFS from Linux File System.
- Experience in querying data using Spark SQL for faster processing of the data sets.
- Offloaded data from EDW into Hadoop Cluster using Sqoop.
- Worked with spark to consume data from Kafka and convert that to common format using Scala
- Created Python notebooks on Azure Databricks for processing the datasets and load them into Azure SQL databases.
- Generated property list for every application dynamically using Python
- Responsible to collect, clean, and store data for analysis using Kafka, Sqoop, Spark, HDFS
- Used Kafka and Spark framework for real time and batch data processing
- Ingested large amount of data from different data sources into HDFS using Kafka
- Worked on advanced analytics using Python with Modules like Pandas, NumPy and Diplomata for data extraction, manipulation.
- Developed Sqoop scripts for importing and exporting data into HDFS and Hive
- Created Hive internal and external Tables by Partitioning, bucketing for further Analysis using Hive
- Used Oozie workflow to automate and schedule jobs
- Reverse engineer and re-implement legacy back-end software into Python
- Used Zookeeper for maintaining and monitoring clusters
- Exported the data intoAWS using Sqoop for BI team to perform visualization and to generate reports
- Responsible for collecting, scrubbing, and extracting data from var generate reports, dashboards, and analytical solutions. Helped in debugging the Tableau dashboards.
- Continuously monitored and managed the Hadoop Cluster using Cloudera Manager
- Used JIRA for project tracking and participated in daily scrum meetings
Environment: Spark, Sqoop, Scala, Hive, Kafka, YARN, Teradata,AWS, HDFS, Oozie, Zookeeper, HBase, Tableau, Hadoop (Cloudera), JIRA
Oracle PL/SQL Developer
Confidential
Responsibilities:
- Participated in Designing databases (schemas) to ensure that the relationship between data is guided by tightly bound Key constraints.
- Involved in Business Requirements, System analysis and Design of the Data warehouse application.
- Data Analysis Primarily Identifying Data Sets, Source Data, Source Meta Data, Data Definitions and Data Formats. Designed Physical and Logical Data model and Data flow diagrams.
- Involved in the creation of database objects like Tables, Views, Stored Procedures, Functions, Packages, DB triggers and Indexes.
- Worked on various tables to create Indexes to improve query performance. Also, worked on partitioning.
- Tables using Range Partitioning, creating Index Organized Table, Rollback Table space.
- Wrote conversion scripts using SQL, PL/SQL, T-SQL, stored procedures, functions, and packages to migrate data from SQL server database to Oracle database.
- Maintain applications data models and support legacy application data migration using PL/SQL and stored procedures, which includes checking database source code in/out of source control packages.
- Built database objects like Tables and Views. Defined both logical views and physical data structure using star schema.
- Converted all Oracle ETL Packages to Informatica Mappings and created workflows/Sessions.
- Developed Informatica Mappings using various Transformations and PL/SQL Packages to extract, cleanse transformation and loading of data.
- Filtered and Loaded data from different formats of data sources into Database Tables.
- Extracted required data from the database tables and exported the data to different sources in different formats. Worked with several tools to access and perform several operations on database and Experience in Generating Reports.
- Developed Data entry, query and reports request screens and tuned the SQL queries.
- Used joins, indexes effectively in where clauses for Query optimization.
- Assisted in gathering requirements by performing system analysis of the requirements with the technology teams.
Environment: Toad, SQL* Plus, SQL* Loader, Oracle 10g, SQL Server, Informatica 8.5, Windows XP/7, UNIX.
Software Engineer intern
Confidential
Responsibilities:
- Designed and developed a web application using HTML5, CSS, JavaScript, JSP, jQuery.
- Participated in the analysis, design, and development phase of the Software Development Lifecycle (SDLC).
- Developed front-end components using HTML, JavaScript and jQuery, Back End components using Java, Spring, Hibernate, Services Oriented components using Restful, and SOAP based web services, and Rules based components using JBoss Drools.
- Wrote SQL queries, joins, views, stored procedures for multiple databases, Oracle, and SQL Server 2005.
- Wrote Stored Procedures using PL/SQL.
- Implemented query optimization to achieve faster indexing and making the application more scalable.
- Worked with Struts MVC objects like Action Servlet, Controllers, validators, Handler Mapping, Message Resource Bundles, Form Controller, and JNDI for look-up for J2EE components.
- Implemented the connectivity to the database server Using JDBC.
- Developed the RESTful web services using Spring IOC to provide application users a way to run the job and generate daily status reports.
- Created SOAP handler to enable authentication and audit logging during web service calls.
- Used Restful Services to interact with the Client by providing the Restful URL mapping.
- Created Service Layer API and Domain objects using STRUTS.
- Implementing project using Agile SCRUM methodology, involved in daily stand-up meetings and sprint showcase and sprint retrospective.
Environment: Java, Java JDK, JEE, JAX, Web services, Rest API, JSON, Java Beans, jQuery, JavaScript, Oracle Spring framework, Spring Model View Controller (MVC), Java Server Pages (JSP), Servlets, JDBC, JUnit, HTML5, CSS, jQuery, JUnit, Eclipse.