Project and results - oriented IT professional with years of experience on analysis, development, and deployment of data centric solutions for various industries like retail, banking, telecom, insurance, energy and government involving large scale data warehouse and real time analytics both on premises and multi-cloud environment.
Big Data/Hadoop : HDFS, YARN, MapReduce, Hive, Spark, Impala, Sqoop, Flume, Cloudera, Hortonworks HDP, Ambari, Ranger
Programming: PySpark, SQL, PL/SQL, HiveQL, Python
Schedulers: Airflow, Autosys, Tidal
Methodology : SDLC, Waterfall, Agile, JIRA
SQL/NOSQL : Hive, Spark SQL, Oracle, ORC, Parquet
Cloud : AWS, Azure, Databricks, S3, VPC, EC2, Glue, SQS, SNS, Lambda, ADLS, ADF
DataBASES : AWS Redshift, RDS, DynamoDB, Databricks Delta, Azure SQL
Tools: PyCharm, VS Code, Zeppelin, Jupyter, Subversion, Git, Docker
- As part of data practice, responsible for assessing applications and identify data lineage and requirements to defining the strategy, architecture, implementation plan, and delivery of data centric applications to establish the long - term strategy and short-term scope for a multi-phased Big data applications and cloud data warehouses ( DWaaS ).
- Developed and involved in the architectural design reviews on AWS. Analyzing and reviewing business, functional and high-level technical requirements.
- Designing and deploying scalable, highly available, and fault tolerant systems on AWS .
- Design and develop serverless analytics pipeline on AWS platform using AWS Lambda S3, AWS Glue, PySpark, Athena.
- Worked on data lake architecture, data models on Azure Databricks / Databricks Delta using Azure data factory, Azure Data Lake Store, Azure SQL.
- Worked as part of data architecture team for designing data solutions & modelling systems to be used for data analytics and machine learning applications.
- Working on cloud data POC s on Azure & AWS platforms for multiple high-profile clients and conducting demos.
- Create, assign and manage JIRA board tickets with low level implementation details and write technical design document.
- Being part of the process of corporate ETL process modernization I am part of the analysis, design and development of a new solution which integrates Python, Spark, Hive, Parquet, JSON, Airflow, YAML, Oracle, Cloudera and AWS to delivers a high trough output of data and transformations from a wide variety of applications for advertisement outreach and revenue optimization.
- Work with internal business and technology staff to accurately capture requirements and specifications for the design of database models to service reporting requirements.
- Understanding of the business rules, business logic and use cases to be implemented and architecting a unified code base which is extensible and scalable.
- Creating ETL pipelines using Airflow, Glue, Pyspark, AWS Lambda.
- Developed Hive/PySpark modules for viewership analytics.
- Developed data pipelines (ingest/clean/munge/transform) for feature extraction towards predictive analytics and ad revenue optimization.
Solutions Consultant (Big Data)
- Worked on design, architecture, and implementation of big data pipeline and HDFS ingestion from various sources for efficient process and supporting real-time queries and analysis for DSS.
- Development of Big Data projects using Hadoop, HDFS and data lakes.
- Design, build, and support cloud and open source systems to process large amounts of data. Importing and exporting data between HDFS and RDBMS using Sqoop/Hive to implement transactional support and incremental loads.
- Creation of RDD’s and Dataframes for the input data and performed transformations using PySpark to ingest/catalog/store/analyze new datasets with final analytics.
- Used Spark APIs to perform necessary transformations and actions on the fly for building data pipelines which gets the data in batch and real-time in lambda architecture.
- Big Data Analytics, ETL, Data Analysis and Visualization on Hortonworks and Azure Platforms .
- I develop and perform experimental design approaches to validate findings or test hypotheses to investigate, develop and propose new analytic capabilities.
- Collaborated with Solutions Architect to define the best strategy to migrate data sources load to, extract from, and interact with Hadoop (CDH distribution) for the Client
- Reporting solution this in accordance with defined Enterprise Data Pattern
- Implemented and modified scripts to load and extract from Hadoop using Sqoop, Hive and Impala for required client reporting and ad-hoc query solutions.
- Working on Spark to be able to explore, cleanse and analyze data by manipulating
- RDDs, dataframes and using Spark SQL to be able to provide a SQL interface and interoperability with Hive and other data sources.
- Data cleansing, profiling and transformations on flat file data is being done to store files in parquet format.