Azure Data Engineer Resume
TexaS
SUMMARY
- Senior Data Engineer with around 8 years of extensive experience in Analysis, Design, Application Development, Integration testing and Maintaining BI solutions.
- Strong experience and understanding of implementing large scale Data Warehousing programs and E2E data Integration solutions using Informatica PowerCenter, Azure Data Factory (ADF), Databricks, AWS Redshift, Hadoop Distribution and Teradata.
- Involved in development of deliverables and roadmaps to advance the migration of on - premises Traditional Database systems into cloud Datawarehouse (Azure/Aws).
- Gained expertise in implementing migration strategies for on-premises traditional systems onto Azure (Lift and shift/Azure Migrate, other third-party tools) and worked on Azure suite: Azure SQL Database, Azure Data Lake (ADLS), Azure Data Factory (ADF) V2, Azure SQL Data Warehouse, Azure Service Bus, Azure key Vault, Azure Analysis Service (AAS), Azure Blob Storage, Azure Search, Azure App Service, Azure data Platform Services.
- Deep insights and experience on AWS services: Redshift, Redshift Spectrum, S3, Glue, Databricks, Athena, Lambda, CloudWatch and EMRs like Hive, Presto
- Proficient in Hadoop led development of enterprise level solutions utilizing Hadoop components such as Apache Spark, MapReduce, HDFS, Sqoop, PIG, HIVE, HBase, Oozie, Flume, Kafka, Zookeeper and YARN.
- Competence on building Spark applications using Pyspark and Spark-SQL in Azure Databricks for various ETL operations which includes extraction and data transformations.
- Experienced in Data Manipulation using python for loading and extraction as well as with python libraries such as Pandas and Numpy for data analysis and numerical computations.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Datawarehouse environment.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight.
- Hands on experience in setting up workflow using Apache Airflow for managing and scheduling jobs.
- Experience with container-based deployments using Docker, working with Docker images and Docker registries.
- Worked on Repository tools like SVN, GITHUB and Jenkins.
- Extensively worked on data extraction, Transformation and loading data from various sources like Oracle, SQL Server, Teradata, DB2, Flat files, stored procedures (PL-SQL).
- Working knowledge of data architecture principles, terminology, and data modeling techniques.
- Experience in transforming business information into conceptual, logical and physical data models using Erwin Model.
- Extensive experience in developing various ETL mappings to load data into OLAP/Data Warehouse. Worked with various complex data transformations - Lookup, Union, Rank, Update Strategy, Transaction Control, Aggregator, SQL Transformation, Normalizer, Sequence generator, Expression, CDC etc.
- Hands-on experience in tuning mappings and resolving performance bottlenecks on various levels like sources, targets, mappings, and sessions.
- Strong understanding of Relational and Dimensional Data Modeling with concepts of Star Schema, Snowflake Schema, Facts, Dimension tables, Slowly Changing Dimensions, Surrogate keys.
- Experience in Oracle PL/SQL(Stored Procedures, Triggers and Packages) performance tuning of Oracle using SQL trace, SQL plan, SQL hints, Oracle partitioning, various indexes and join types.
- Experience in using Teradata utilities - MLoad, BTeq, FastExport, TPT and FastLoad to design and develop dataflow paths for loading, transforming, and maintaining data warehouse.
- Hands-on experience on leading team across all stages of Software Development Life Cycle (SDLC) including business requirement, analysis, data mapping, build, unit testing, systems integration and UAT.
- Known for integrity, exemplary work ethic and strong commitment to provide innovative and strategic solutions aligned with diverse client needs and interests.
TECHNICAL SKILLS
ETL: Informatica PowerCenter, Informatica Data Quality (IDQ), Azure Data Factory, AWS Glue
DBMS: MS SQL Server, Oracle Exadata, Spark SQL, PostgreSQL
DWH: Snowflake, Teradata, Azure SQL Datawarehouse or Azure Synapse, AWS Redshift
BI Reporting: OBIEE, Tableau, MicrosoftPower BI.
Languages: PL/SQL, C, Data Structures, UNIX Shell Script, sca, Python
Data Modelling: Dimensional and Relational - Star/ Snowflake Schema, Fact and Dimension Tables.
Hadoop Distributions: Apache Hadoop 1x/2x, Cloudera CDP, Hortonworks HDP
No SQL Databases: MongoDB and Cassandra
Devops: Azure Devops, Jenkins, Docker, GITHUB
PROFESSIONAL EXPERIENCE
Confidential, Texas
Azure Data Engineer
Responsibilities:
- Designing and developing data pipelines by integrating multiple cloud data sources using ADF and orchestrate the data process.
- Formulating data ingestion pipelines on Azure HDInsight spark cluster using Azure Data factory and Spark SQL.
- Working extensively on Azure Data Factory data transformations, Integration runtimes, Azure Key Vaults
- Triggers and migrating data factory pipelines to higher environments using Azure resource manager templates.
- Developing the Pyspark code for extracting the data from on-premises source system and to transform and aggregate the data from multiple file formats.
- Utilized Erwin to develop logical data models from functional specifications, data requirements, and business rules provided by clients.
- Creating pipelines in ADF using linked services/Datasets/pipeline to extract data from various sources.
- Building data pipelines and applying data transformations for batch and real-time messaging systems using Pyspark and Spark-SQL and then processing the data into Azure Synapse.
- Creating Python scripts, UDFs using data frames/SQL, and RDD in Spark for data aggregation.
- Responsible for estimating cluster size, monitoring and trouble shooting of sparks data bricks cluster.
- Formulating JSON scripts for deploying the pipelines in ADF that process the data into SQL activity.
- Recreating existing application logic and functionality in the Azure Data Lake, Data Factory, SQL Database and SQL Datawarehouse environment.
- Writing procedures, functions and views and triggers for Data manipulation in Azure SQL DWH.
- Working on performance tuning the Spark applications for achieving optimum batch execution time, correcting level of parallelism and memory tuning.
- Utilizing Azure Devops and ARM Templates to deploy the data pipelines and to automate the CI/CD pipelines and test-driven development pipelines.
- Knowledge on Configuring, automating and deploying ansible for configuration management to existing infrastructure.
- Automating the data pipelines using various triggers like event based/tumbling window, and schedule them for daily/weekly executions in production environment.
Environment: Azure Databricks, Azure Data Factory, Spark SQL, Win SCP, UNIX, GITHUB, parquet files, JSON, Azure SQL, Azure Synapase.
Confidential, Texas
AWS Data Engineer
Responsibilities:
- Designed and set up Enterprise Data Lake to provide support for various use cases including Analytics, processing, storing, and Reporting of voluminous, rapidly changing data.
- Held Responsible for maintaining quality reference data in source by performing operations such as cleaning, transformation and ensuring Integrity in a relational environment by working closely with the stakeholders & solution architect.
- Designed and developed Security Framework to provide fine grained access to objects in AWS S3 using AWS Lambda, DynamoDB.
- Set up and worked on Kerberos authentication principals to establish secure network communication on cluster and testing of HDFS, Hive, Pig, and MapReduce to access cluster for new users.
- Scheduled Spark jobs in production environment using Apache Oozie scheduler and managed and automated jobs using oozie workflow engine.
- Responsible for loading Data pipelines from web servers using Kafka and Spark Streaming API.
- Performed end- to-end Architecture & implementation assessment of various AWS services like Amazon EMR, Redshift, S3.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Import the data from different sources like HDFS/HBase into Spark RDD and perform computations using Pyspark to generate the output response and configured Oozie workflows to generate Analytical Reports.
- Created Lambda functions with Boto3 to deregister unused AMIs in all application regions to reduce the cost for EC2 resources.
- Developed reusable framework to be leveraged for future migrations that automates ETL from RDBMS systems to the Data Lake utilizing Spark Data Sources and Hive data objects.
- Conducted Data blending, Data preparation using Alteryx and SQL for Tableau consumption and publishing data sources to Tableau server.
Environment: AWS EMR, S3, RDS, Redshift, Lambda, Boto3, DynamoDB, Amazon SageMaker, Apache Spark, HBase, HIVE, SQOOP, Python, Snowflake,Tableau.
Confidential, San Francisco, CA
Data Engineer
Responsibilities:
- Designed and implemented migration strategies for traditional on-premises database to move onto AWS cloud.
- Formulated and developed scalable solutions for storing and processing large amounts of data across multiple regions.
- Developed and implemented data pipelines using AWS services such as Kinesis, S3, EMR, Athena, Glue and Redshift to process petabyte-scale of data
- Involved in building logical and physical data model, defining roles, and required privileges for RedShift DB objects.
- Configured and scheduled jobs using Airflow scripts using python and adding different tasks to DAGs and appropriate dependencies between the tasks.
- Validated the data from SQL server to AWS Redshift for historical data migration.
- Analyzed the business requirements and translate them into technical specifications that can be used by developers to implement new features or enhancements.
- Participated in cross-functional teams (e.g., infrastructure engineering) when required to ensure effective communication between groups with overlapping functionality or shared resources.
- Implemented a data warehouse using Redshift to store and analyze terabytes of raw data
- Created custom dashboards with data for real-time monitoring of key business metrics
- Conducted data analysis to support business decision-making by extracting, cleansing, and manipulating data from various sources.
- Created data visualizations to communicate complex data sets in an easily understandable format for business users.
- Developed and maintained reporting dashboards to track KPIs and other business metrics.
Environment: AWS EMR, S3, RDS, Redshift, Apache Airflow, AWS Glue, AWS Athena, AWS Kinesis, Tableau
Confidential, Irvine, CA
Big Data Engineer
Responsibilities:
- Collaborated with Business Analysts and product owners to understand requirements and build scalable distributed data solutions using the Hadoop ecosystem.
- Used Sqoop to load the data from various RDBMS sources (Teradata) to HADOOP systems (Hive) and vice versa.
- Developed python scripts to extract the data from web server output files to load into HDFS.
- Used Flume to collect, aggregate, and store web log data from different sources like web servers and pushed to HDFS.
- Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and Stateful transformations.
- Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries.
- Developed a python script to load the CSV files into S3 buckets and performed folder management in each bucket, managed logs, and objects within each bucket.
- Developed data governance strategy and controls to ensure consistency of data between various systems and components.
- Created workflows in Airflow to automate the tasks of loading the data into HDFS and pre-processing with PIG and HIVE.
- Supported in file movements between HDFS and AWS S3 and comprehensively worked with S3 buckets in AWS.
- Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
- Written multiple MapReduce Jobs using Java API, Pig and Hive for data extraction, transformation and aggregation from multiple file formats including Parquet, Avro, XML, JSON, CSV, ORCFILE and other compressed file formats like gZip, Snappy, Lzo.
- Strong understanding of Partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Developed PIG UDFs for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders.
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Developed data pipeline programs with spark APIs, data aggregations with Hive, and formatting data (JSON) for visualization.
Environment: AWS, Cassandra, Pyspark, Apache Spark, HBase, Apache Kafka, HIVE, SQOOP, FLUME, Apache Airflow, Zookeeper, ETL, UDF, Map Reduce, Apache Pig, Python
Confidential, Kansas City, MO
ETL Developer
Responsibilities:
- Requirements gathering - Involved in all the discussions with the client during requirements phase.
- Analysis & Architecture - Involved in analyzing the existing legacy source systems present and define the solution architecture for the given requirements.
- Engaged with clients and project teams for data architecture.
- Identify the integration points between the new source systems and the existing source systems. Design the new source system to integrate with the existing source systems.
- Experience in enterprise data modeling standards and tools such as Erwin.
- Worked on 175+ complex business rules involving extensive testing within budget to deliver a solution within stringent timelines.
- Designed and built 50+ technical components comprising complex ETL, Database objects and Batch Scheduling components.
- Built ETL Mappings to load data from various 4 source systems (Oracle, HIVE, SQL server, Flat File) by incorporating business rules using different objects and functions that the tool supports and load into Global Data Warehouse (Teradata).
- Involved in building the ETL architecture/framework for efficient processing of Source data (flat files, XML and SQL Server) into ODS-Staging (Oracle) and then finally into Teradata Datawarehouse.
- Created ETL mappings to read the data from HIVE and load into Staging area within the Datamart.
- Developed a python script to download the files from an external URL and to place in a shared location which will be used in data processing.
- Developed PL/SQL packages, Triggers, Stored Procedures and Views as per business requirement for data processing.
- Devised TD scripts to load data into some worktables using FastLoad, MultiLoad and BTEQ utilities of Teradata.
- Performed application-level DBA activities such as collection of statistics, monitoring spool space, creating tables, indexes, monitored and tuned Teradata BTEQ scripts using Teradata.
- Created Semantic views and Teradata objects like Databases, Users, Profiles, Roles, Tables, Views, and Macros.
- Worked on Teradata parallel transport (TPT) to load data from databases and files within Teradata warehouse.
- Implemented partitioning, parallel processing in Informatica and Db partitioning on tables, hints across various hops between Operational Data store and Data warehouse to improve performance of the system.
- Preparation of Test data for Unit testing and data validation tests to confirm the transformation logic.
- Created and Scheduled Sessions and Batch Processes using CA-Workload Autosys and Unix shell scripts.
- Involving in 24*7 ETL production support, BCP Failover Exercise, maintenance, trouble shooting, problem fixing and ongoing enhancements to the data warehouse.
Environment: Informatica PowerCenter, Informatica Data Quality (IDQ), Oracle Exadata, WinSCP, UNIX, MS Office, Autosys, SQL Server, Teradata, Python
Confidential
ETL Engineer
Responsibilities:
- Designed and Built 100+ technical components comprising complex ETL, Database, and Batch Scheduling components.
- Extracted data from flat files and applied business logic and then loaded the data into SQL server database.
- Effectively utilized transformations like the normalizer, filter, router, expression, aggregator, joiner, lookup (connected and unconnected), union, update strategy transformations.
- Design and maintain logical and physical database models with Erwin.
- Created multiple ETL Design Patterns which enabled code reuse, approx. 80 hours savings in development time, increased reliability, and scalability with ease of support transition for Database, Queues, XMLs & File sources.
- Implemented complex ETL’s with optimal performance to load data warehouse by incorporating CDC, SCD type 1 & 2, error handling and exception data processing.
- Created two reusable mapplets which can be used for in ETLs with conforming dimensions thus saving approximately 40% of man hours behind each ETL estimate.
- Implemented Stored Procedures to transform the Data & worked extensively in T-SQL for various needs of the transformations while loading the data.
- Created database objects - Procedures, Functions, Triggers, Indexes & Views using T-SQL in Development environment for SQL Server 2008R2.
- Appling performance techniques like PDO, partitions and best practices for each transformation to avoid performance issues.
- Worked on Causal Analysis, Resolution, and performance tuning for the project to ensure quality code for technical components during system test and quality assurance phases.
- Used Tidal extensively to schedule all the newly built interfaces for production, based on the impact analysis on the existing system.
Environment: Informatica PowerCenter, MS SQL Server, Win SCP, HP Quality Center, UNIX, MS Office, Tidal, SVN
