- Around 6 years of experience into Big - data related technologies on various domains like Insurance & Finance.
- 2+ years of experience as Azure Cloud Data Engineer in Microsoft Azure Cloud technologies including Azure Data Factory(ADF), Azure Data Lake Storage(ADLS), Azure Synapse Analytics(SQL Data warehouse), Azure SQL Database, Azure Analytical services, Polybase, Azure Cosmos NoSQLDB, Azure Key vaults, Azure Devops, Azure HDInsight BigData Technologies like Hadoop, Apache Spark and Azure Data bricks.
- Big Data - Hadoop (MapReduce & Hive), Spark (SQL, Streaming), Azure Cosmos DB, SQL Datawarehouse, Azure DMS, Azure Data Factory, AWS Redshift, Athena, Lambda, Step Function and SQL.
- Strong knowledge in Spark ecosystems such as Spark core, Spark SQL, Spark Streaming libraries.
- Very Good experience working in Azure Cloud, Azure DevOps, Azure Data Factory, Azure Data Lake Storage, Azure Synapse Analytics, Azure Analytical services, Azure Cosmos NO SQL DB, Azure HD Insight BigData Technologies (Hadoop and Apache Spark) and Data bricks.
- Experience in designing Azure Cloud Architectureand Implementation plans for hosting complex application workloads on MS Azure.
- Experience working in reading Continuous json data from different source system using Kafka into Databricks Delta and processing the files using Apache Structured streaming, PySpark and creating the files in parquet format.
- Created manual TestCases to check that each of the deliverables meet user’s requirement.
- Good knowledge in Apache Hadoop ecosystem components Spark, Cassandra, HDFS, Hive, SQOOP, Airflow.
- Experienced in working with different data formats CSV, JSON and Parquet.
- Strong in Data Warehousing concepts, Star schema and Snowflake schema methodologies, understanding Business process/requirements.
- Expert in building hierarchical and Analytical sql queries that helps in reporting.
- Expert in implementing Business Rules by creating re-usable transformations like mapplets and mappings.
- Expert in using debugger in Informatica designer tool to test and fix errors in the mappings. Supported ad-hoc reporting and analytics request with an eye for creating scalable self-service or automated solutions
- Developed and worked on Machine Learning algorithms for predictive modelling
- Architected complete scalable data pipelines,datawarehousefor optimized data ingestion
- Collaborated with data scientists and architects on several projects to create data mart as per requirement
- Conducted complex data analysis and report on results
- Constructed data staging layers and fast real-time systems to feed BI applications and machine learning algorithms
- Understanding of AWS, Azure webservices and at least hands on experience working in projects. Knowledge of the software development life cycle, agile methodologies, and test-driven development.
- Develop scalable and reliable data solutions to move data across systems from multiple sources in real time (Kafka) as well as batch modes (Sqoop)
- Built Enterprise ingestion Spark framework to ingest data from different sources (s3, Salesforce, Excel, SFTP, FTP and JDBC Databases) which is 100% metadata driven and 100% code reuse which lets Junior developers to concentrate on core business logic rather spark/Scala coding
Hadoop Eco System: Spark, Hive, Sqoop, Oozie, Pig
Azure Cloud Platform: ADFv2, BLOB Storage, ADLS, Azure SQL DB, SQL server, Azure Synapse, Azure Analytic Services, Data bricks, Mapping Dataflow (MDF),Azure Data Lake (Gen1/Gen2), AzureCosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning, App Services, Logic Apps, Event Grid, Service Bus, Azure DevOps, GIT Repository Management, ARM Templates
Programming Languages: Python, Scala, R, C, C++, Java, Shell Scripting
Databases and Query Languages: Azure SQL Warehouse, Azure SQL DB, Azure Cosmos No SQL DB, Teradata, Vertica, RDBMS, MySQL, Oracle, PostgreSQL, Microsoft SQL Server
Streaming Frameworks: Kinesis, Kafka, Flume
Tools: R Studio, PyCharm, Jupyter Notebook, IntelliJ, Eclipse, NetBeans
Platforms: Linux, Windows and OS X
Confidential - Framingham, MA
- Used Azure Data Factory extensively for ingesting data from disparate source systems.
- Used Azure Data Factory as an orchestration tool for integrating data from upstream to downstream systems.
- Automated jobs using different triggers (Event, Scheduled and Tumbling) in ADF.
- Used Cosmos DB for storing catalog data and for event sourcing in order processing pipelines.
- Designed and developed user defined functions, stored procedures, triggers for Cosmos DB
- Analyzed the data flow from different sources to target to provide the corresponding design Architecture in Azure environment.
- Take initiative and ownership to provide business solutions on time.
- Created High level technical design documents and Application design document as per the requirements and delivered clear, well-communicated and complete design documents.
- Created DA specs and Mapping Data flow and provided the details to developer along with HLDs.
- Created Build definition and Release definition for Continuous Integration and Continuous Deployment.
- Created Application Interface Document for the downstream to create new interface to transfer and receive the files through Azure Data Share.
- Creating pipelines, data flows and complex data transformations and manipulations using ADF and PySpark with Databricks
- Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.
- Created, provisioned different Databricksclusters needed for batch and continuous streaming data processing and installed the required libraries for the clusters.
- Integrated Azure Active Directory authentication to every Cosmos DB request sent and demoed feature to Stakeholders
- Improved performance by optimizing computing time to process the streaming data and saved cost to company by optimizing the cluster run time.
- Perform ongoing monitoring, automation and refinement of data engineering solutions prepare complex SQL views, stored procs in azure SQL DW and Hyperscale
- Designed and developed a new solution to process the NRT data by using Azure stream analytics, Azure Event Hub and Service Bus Queue.
- Created Linked service to land the data from SFTP location to Azure Data Lake.
- Created numerous pipelines in Azure using Azure Data Factory v2 to get the data from disparate source systems by using different Azure Activities like Move &Transform, Copy, filter, for each, Databricks etc.
- Created several Databricks Sparkjobs with PySpark to perform several tables to table operations.
- Extensively used SQL Server Import and Export Data tool.
- Created database users, logins and permissions to setup.
- Working with complex SQL, Stored Procedures, Triggers, and packages in large databases from various servers.
- Helping team member to resolve any technical issue, Troubleshooting, Project Risk & Issue identification and management
- Addressing resource issue, Monthly one on one, Weekly meeting.
Environment: Azure Cloud, Azure Data Factory (ADF v2), Azure functions Apps, Azure DataLake, BLOB Storage, SQL server, Teradata Utilities, Windows remote desktop, UNIX Shell Scripting, AZURE PowerShell, Data bricks, Python, Erwin Data Modelling Tool, Azure Cosmos DB, Azure Stream Analytics, Azure Event Hub, Azure Machine Learning.
- Used custom developed PySpark scripts to pre-process, transform data and map to tables inside the CIF ( InmonCorporateInformationFactory) data warehouse
- Developed shell scripts of Sqoop jobs for loading periodic incremental imports of structured data from various RDMS to S3 and used Kafka to ingest real-time website traffic data to HDFS
- As part of reverse engineering discussed issues/complex code to be resolved and translated them into Informatica logic and prepared ETL design documents.
- Experienced working with team, lead developers, Interfaced with business analysts, coordinated with management and understand the end user experience
- Used Informatica Designer to create complex mappings using different transformations to move data to a Data Warehouse.
- Developed mappings in Informatica to load the data from various sources into the Data Warehouse using different transformations like Source Qualifier, Expression, Lookup, aggregate, Update Strategy and Joiner.
- Optimized the performance of the mappings by various tests on sources, targets and transformations.
- Scheduling the sessions to extract, transform and load data in to warehouse database on Business requirements using scheduling tool.
- Extracted (Flat files, mainframe files), Transformed and Loaded data into the landing area and then into staging area followed by integration and sematic layer of Data Warehouse (Teradata) using Informatica mappings and complex transformations (Aggregator, Joiner, Lookup, Update Strategy, Source Qualifier, Filter, Router and Expression Optimized the existing ETL pipelines by tuning SQL queries and data partition techniques
- Created independent data marts from existing data warehouse as per the applicationrequirement and updated them on bi-weekly basis
- Decreased the Azure billingby pivoting from using Redshift storage to Hive tables for unpaid services andimplemented various techniques like Partitioning and Bucketing over hive tables to improve the query performance
- Used Presto distributed query engine over hive tables for its high performance and low cost
- Automatedand validated data pipelines using ApacheAirflow
Environment: Sqoop,Informatica, Amazon EMR/Redshift, Presto, Apache Airflow, Hive