Azure Data Engineer Resume
San Rafael, CA
SUMMARY
- Having 6+ years of Professional experience as a Data Engineer with expertise in Python, Azure, Spark, Hadoop Ecosystem, and cloud services etc.
- Extensive experience in developing applications dat perform Data Processing tasks using Teradata, Oracle, SQL Server, and MySQL databases.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating, and moving data from various sources using Apache Flume, Kafka and PowerBI.
- Involved in development of roadmaps and deliverables to advance the migration of existing solutions on - premises systems/applications to Azure cloud.
- Experience with Azure transformation projects and implement ETL and data movement solutions using Azure Data Factory (ADF), SSIS.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Datawarehouse environment
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools) worked on Azure suite: Azure SQL Database, Azure Data Lake(ADLS), Azure Data Factory(ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service(AAS), Azure Blob Storage, Azure Search, Azure App Service, Azure data Platform Services.
- Extensive knowledge and experience in dealing with Relational Database Management Systems Normalization, Stored Procedures, Constraints, Joins, Indexes, Data Import/Export, Triggers.
- Expert at data transformations like lookups, Derived Column, Conditional Splits, Sort, Data Conversation, Multicast and Derived columns, Union All, Merge Joins, Merge, Fuzzy Lookup, Fuzzy Grouping, Pivot, Un - pivot and SCD to load data in SQL SERVER destination.
- Distributions - Cloudera, Amazon EMR, Azure HDInsight, and Hortonworks.
- Experience developing iterative algorithms using Spark Streaming in Scala and Python to builds near real-time dashboards.
- Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase.
- Expertise in working with AWS cloud services like EMR, S3, Redshift, EMR, Lambda, DynamoDB, RDS, SNS, SQS, Glue, Data Pipeline, Athena for big data development.
- Worked with various file formats such as CSV, JSON, XML, ORC, Avro, and Parquet file formats.
- Worked on data processing and transformations and actions in spark by using Python (Spark) language.
- Expertise in writing DDLs and DMLs scripts in SQL and HQL for analytics applications in RDBMS.
- Acquired experience in Spark scripts in Python, Scala, and SQL for advancement in development and examination through analysis.
- Involved in all the phases of Software Development Life Cycle (Requirements Analysis, Design, Development, Testing, Deployment, and Support) and Agile methodologies.
- Strong understanding ofDataModeling and ETL process indatawarehouse environment such as star schema, snowflake schema.
- Strong working knowledge across the technology stack including ETL, data analysis, data cleansing, data matching, data quality, audit, and design.
- Experienced working on Continuous Integration & build tools such as Jenkins and GIT, SVN for version control.
- Utilized Kubernetes and Docker for the runtime environment for the CI/CD system to build, test, and deploy.
TECHNICAL SKILLS
Big Data Eco System: HDFS, Spark, MapReduce, Hive, Pig, Sqoop, Flume, HBase, Kafka Connect, Impala, Stream sets, Oozie, Airflow, Zookeeper, Amazon Web Services.
Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP
Languages: Python, Scala, Java, Pig Latin, HiveQL, Shell Scripting.
Software Methodologies: Agile, SDLC Waterfall.
Databases: MySQL, Oracle, DB2, PostgreSQL, DynamoDB, MS SQL SERVER, Snowflake.
NoSQL: HBase, MongoDB, Cassandra.
ETL/BI: Power BI, Tableau, Informatica.
Version control: GIT, SVN, Bitbucket.
Operating Systems: Windows (XP/7/8/10), Linux (Unix, Ubuntu), Mac OS.
Cloud Technologies: Amazon Web Services, EC2, S3, SQS, SNS, Lambda, EMR, Code Build, CloudWatch. Azure HDInsight (Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, Cosmos DB, Azure DevOps, Active Directory).
PROFESSIONAL EXPERIENCE
Confidential, San Rafael,CA
Azure Data Engineer
Responsibilities:
- Design and implement end-to-end data solutions (storage, integration, processing, visualization) in Azure.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Involved in converting Hive/SQL queries intoSparktransformations usingSpark data frame andScala.
- Created the hive internal and external tables with clustering, partitioning and bucketing techniques in orc and parquet formats.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Develop dashboards and visualizations to halp business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Develop conceptual solutions & create proof-of-concepts to demonstrate viability of solutions.
- Architect and implement ETL and data movement solutions using Azure Data Factory, SSIS create and run SSIS Package ADF V2 Azure-SSIS IR.
- Design and implement database solutions in Azure SQL Data Warehouse, Azure SQL.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Datawarehouse environment. experience in DWH/BI project implementation using Azure DF.
- Migrate data from traditional database systems to Azure databases.
- Interacts with Business Analysts, Users, and SMEs on requirements.
- Designs Logical and Physical Data Model for Staging, DWH and Data Mart layer.
- Develop dashboards and visualizations to halp business users analyze data as well as providing data insight to upper management with a focus on Microsoft products like SQL Server Reporting Services (SSRS) and Power BI.
- Used various sources to pull data into Power BI such as SQL Server, Excel, Oracle, SQL Azure etc.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Design & implement migration strategies for traditional systems on Azure (Lift and shift/Azure Migrate, other third-party tools).
- Engage with business users to gather requirements, design visualizations, and provide training to use self-service BI tools.
- Developed and maintained multiple Power BI dashboards/reports and content packs.
- Created POWER BI Visualizations and Dashboards as per the requirements.
- Installed SQL 2018 SQL server Database engine, SSIS & SSRS features in the Development environment as needed.
- DevelopedSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark Databricks cluster.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Experience in dealing with ambiguities to create rational and functional solution from imperfect requirements.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.
- Responsible for creating Requirements Documentation for various projects.
- Strong analytical skills, proven ability to work well in a multi-disciplined team environment and adapt at learning new tools and processes with ease.
- Experience with SaaS (Software as a Service), PaaS (Platform as a Service), and IaaS (Infrastructure as a Service) solutions.
- Experience in dealing with Windows AzureIaaS - Virtual Networks, Virtual Machines, Cloud Services, Resource Groups, Express Route, Traffic Manager, VPN, Load Balancing, Application Gateways and Auto-Scaling.
- Experience in handlingAzureStorage, Blob Storage, File Storage, Setting up ofAzureCDN and load balancers.
Environment: Azure SQL, Azure Data Factory, Azure Storage Explorer, Azure Blob, Power BI Desktop, PowerShell, C# .Net, Adobe Analytics, Fiddler, SSIS, SSAS, SSRS, Asimov X-Flow, DataGrid, ETL Extract Transformation and Load, Business Intelligence (BI), Azure Storage, Azure Blob Storage, Azure Backup, Azure Files, Azure Data Lake Storage Gen 1/Gen 2.
Confidential, Parsippany, NJ
Azure Data Engineer
Responsibilities:
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop MapReduce, HDFS, HBase, Hive, Spark, Sqoop, Pig, Zookeeper and Flume.
- Data warehouse, Business Intelligence architecture design and develop. Designed the ETL process from various sources into Hadoop/HDFS for analysis and further processing of data modules.
- Designing and Creating Azure Data Factory (ADF) extensively for ingesting data from different source systems like relational and non-relational to meet business functional requirements.
- Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, Spark SQL Azure Data Lake Analytics.
- Data Ingestion to at least one Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in In Azure Databricks.
- Created, provisioned numerous Databricks clusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Developed ADF Pipelines to load data from on prem to AZURE cloud Storage and databases.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and Pyspark with Databricks.
- Extensively chipped on Spark Context, Spark-SQL, RDD's Transformation, Actions and Data Frames.
- Developed custom ETL solutions, batch processing and real-time data ingestion pipeline to move data in and out of Hadoop using Pyspark and shell scripting.
- Ingested gigantic volume and variety of data from disparate source systems into Azure Data Lake Gen2 using Azure Data Factory V2 by using Azure Cluster services.
- Created Spark RDDs from data files and then performed transformations and actions to other RDDs.
- Created Hive Tables with dynamic and static partitioning including buckets for efficiency. Also Created external tables in HIVE for staging purposes.
- Loaded HIVE tables with data, wrote hive queries dat run on MapReduce and Created customized BI tool for management teams dat perform query analytics using HiveQL.
- To meet specific business requirements wrote UDF’s in Scala and PySpark.
- Experience in developing Spark applications using Spark-SQL inEMR for data extraction, transformation, and aggregation from multiple file formats for Analyzing& transforming the data to uncover insights into the customer usage patterns.
- Utilized Spark in Memory capabilities, to handle large datasets.
- Used Broadcast variables in Spark, TEMPeffective & efficient Joins, transformations, and other capabilities for data processing.
- Experienced in working with EMR cluster and S3 in AWS cloud.
- Creating Hive tables, loading and analyzing data using hive scripts.
- Implemented Partitioning (both dynamic Partitions and Static Partitions) and Bucketing in HIVE.
- Involved in continuous Integration of application using Jenkins.
- Lead in Installation, integration, and configuration of Jenkins CI/CD, including installation of Jenkins plugins.
- Implemented aCI/CDpipeline withDocker,Jenkins,andGitHubby virtualizing the servers using Docker for the Dev and Test environments by achieving needs through configuring automation using Containerization.
- Installing, configuring, and administering Jenkins CI tool using Chef on AWS EC2 instances.
- Performed Code Reviews and responsible for Design, Code, and Test signoff.
- Worked on designing and developing the Real - Time Tax Computation Engine usingOracle, Stream Sets, Kafka, Spark Structured Streaming.
- Validated data transformations and performed End-to-End data validations for ETL workflows loading data from XMLs to EDW.
- Extensively utilized Informatica to create complete ETL process and load data into database which was to be used by Reporting Services.
- Created Tidal Job events to schedule the ETL extract workflows and to modify the tier point notifications.
Environment: Python, SQL, Oracle, Hive, Scala, Power BI, Azure Data Factory, Data Lake, Docker, Mongo DB, Kubernetes, PySpark, SNS, Kafka, Data Warehouse, Sqoop, Pig, Zookeeper, Flume, Hadoop, Airflow, Spark, EMR, EC2, S3, Git, GCP, Lambda, Glue, ETL, Databricks, Snowflake, AWSDataPipeline.
Confidential, Atlanta, GA
Big Data Developer
Responsibilities:
- DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or loading takes place.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
- Worked on analyzing Hadoop Cluster and different big data analytic tools.
- Worked with various HDFS file formats like Avro, Sequence File, Json.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
- Developed Spark scripts by usingScalashell commands as per the requirement.
- Imported the data fromCASSANDRAdatabases and stored it intoAWS.
- Involved in convertingHive/SQLqueries into Spark transformations using Spark RDDs.
- Used AmazonCLIfor data transfers to and fromAmazon S3 buckets.
- ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data is stored inS3 Buckets.
- Implemented the workflows using ApacheOozieframework to automate tasks.
- ImplementedSpark RDDtransformations, actions to implement business analysis.
- DevelopedSparkscriptsby usingScalashell commands as per the requirement.
- Interacted with the infrastructure, network, database, application, and BA teams to ensure data quality and availability.
- Worked on Ingesting high volumes of tuning events generated by client Set-top boxes from elastic search in batch mode and from Amazon Kinesis Streaming in real-time via Kafka brokers to Enterprise Data Lake using Python and NiFi.
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views, and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Developed data pipelines with NiFi, which can consume any format of real-time data from a Kafka topic and pushing dis data into enterprise Hadoop environment.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Environment: HDFS, MapReduce, Snowflake, Pig, Nifi, Hive, Kafka, Spark, PL/SQL, AWS, S3 Buckets, EMR, Scala, Sql Server, Cassandra, Oozie.
Confidential
Big Data Engineer
Responsibilities:
- Developed Spark streaming model which gets transactional data as input from multiple sources and create multiple batches and later processed for already trained fraud detection model and error records.
- Extensive knowledge in Data transformations, Mapping, Cleansing, Monitoring, Debugging, performance tuning and troubleshooting Hadoop clusters.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- DevelopedDDLsandDMLsscripts inSQLandHQLfor creating tables and analyze the data inRDBMSandHive.
- Used Sqoop to import and export data fromHDFStoRDBMSand vice-versa.
- Created Hive tables and involved in data loading and writing Hive UDFs.
- Exported the analyzed data to the relational databaseMySQLusingSqoopfor visualization and to generate reports.
- Loaded the flat files data using Informatica to the staging area.
- Researched and recommended suitable technology stack for Hadoop migration considering current enterprise architecture.
- Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
- Performed Data Profiling, Data pipelining, and Data Mining, validating, and analyzing data (Exploratory analysis / Statistical analysis) and generating reports.
- Responsible for building scalable distributed data solutions using Hadoop.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis. Used Sqoop to transfer data between relational databases and Hadoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Analyzed data stored in S3 buckets using SQL, PySpark and stored the processes data in Redshift and validated data sets by implementing Spark components
- Worked as ETL developer and Tableau developer and widely involved in Designing, Development Debugging of ETL mappings using Informatica designer tool as well as Created advanced chart types, visualizations, and complex calculations to manipulate the data using Tableau Desktop
Environment: Spark, Hive, Python, HDFS, Sqoop, Tableau, HBase, Scala, MySQL, Impala, AWS, S3, EC2, Redshift, Tableau, Informatica