- Over 15 years of experience in Data Warehousing, Analytics and ETL processes in various business domains like retail, manufacturing, insurance and banking domain.
- Profound in Apache Hadoop ecosystems: Yarn, Spark, Pig, Hive, Flume, Sqoop, HBase, Zookeeper, Impala, strong understanding of HDFS and MapReduce architecture with Cloudera and Hortonworks.
- Strong Data Warehousing ETL experience of using Informatica 9.x/8.x/7.x Power Center tools.
- Experience in configuring and using AWS cloud components to push/pull/process data from different cloud storages.
- Strong Knowledge on ER modeling and Dimensional Data Modeling Methodologies like Star Schema and Snowflake Schema.
Big Data Ecosystems: MapReduce, HBase, Pig, Hive, Sqoop, Spark, YARN, Storm, Flume, Kafka Ooziee, Zookeeper, EC2, EMR, S3, Kinesis, CloudWatch.
Programming Languages: Java, C, SQL, Scala, Pig Latin, HiveQL, Shell Scripting, Python.
Database and Tools: MySQL, SQLite, Oracle, Teradata, MS SQL, MongoDB, Cassandra, DBeaver, DataStax DevCenter, SQL developer, MySQL Workbench.
ETL Tools: Informatica Power Center 7.x,8.x 9.x, BigData Edition, SSIS, DTS.
Scheduling Tools: Control - M, AutoSys, IBM TWS, Apache Airflow.
Visualization/Reporting: Tableau, Pentaho.
Dev and Build Tools: Maven, Ant, Eclipse, Scala IDE, Jira, BitBucket, GIT, Jenkins, Docker.
Methodologies and Tools: Waterfall, Agile (Scrum and Kanban), MS Project.
Confidential, Dallas, TX
Technology Lead - Hadoop
- AA to store and join customer centric data like click stream, sales, email campaigns in generating UCIDs and personalization which are consumed by CXP through APIs.
- Developed data models for personalization and product recommendations using Storm, Kafka, Hive, Pig. provide insights into percentage of penetration of sales by associates and stores.
- Developed scripts for data migration of enterprise data from in-house infra to AWS cloud.
Environment : Hadoop2.6-chd5.13(40 node), AWS cloud, Hive1.2.1, Storm, Cassandra, Solr, CouchDB, EC2, S3, Airflow.
Confidential, Philadelphia. PA
- Hadoop and Informatica based ETL and analytical system to have insights about customer’s usage of Confidential products across different product line leading to future enhancements, improvement in business and services.
- Developed data pipeline using Spark, Kafka, Hive, Pig and HBase to ingest customer system usage data and financial histories into Hadoop cluster for analysis.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for data aggregation and writing data back into S3 through Sqoop.
- Extensively used Informatica to create data ingestion jobs into HDFS using complex data file objects such as AVRO and Parquet and to evaluate dynamic mapping capabilities.
- Implement Data Quality Rules using Informatica Data Quality (IDQ) to check correctness of the source files and perform the data cleansing/enrichment.
- Analyze log records data a day and its aggregated hourly, daily reporting using Tableau.
Environment : Hadoop2.7, Informatica9.x, Hive1.2.1, Spark1.6, Teradata, Oracle, EC2, S3.
Confidential, San Jose, CA
- Worked with highly unstructured and semi structured data of 100TB+ in size
- Developed Pig and Hive scripts to be used by end user / analyst / product manager’s requirements for adhoc analysis.
- Used Informatica to validate and test the business logic implemented in the mappings and fix the bugs. Developed reusable Mapplets and Transformations.
- Managed External tables in Hive for optimized performance using Sqoop jobs.
- Solved performance issues in Hive and Pig scripts with understanding of joins, group and aggregation and how it translates to MapReduce jobs.
- Exploring with Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD's, Spark on YARN.
- Worked with Hadoop-Kerberos security environment which is supported by the Cloudera team.
Environment : 32 Node Hadoop 2.6 cluster, Informatica9.x, HDFS, Flume 1.5, Sqoop 1.4.3, Hive 1.0.1, Spark 1.4, HBase, XML, JSON, Teradata, Oracle, MongoDB, Cassandra.
- Migration of 100+ TBs of data from different databases (i.e. Oracle, SQL Server) to Hadoop.
- Wiring code in different applications of Hadoop and Informatica Ecosystem
- Extensively involved in performance tuning of the Informatica ETL mappings by using the caches and overriding the SQL queries and also by using Parameter files.
- Worked on various file formats Avro, SerDe, Parquet, and Text by using snappy compression.
- Used Pig Custom Loaders to load different forms of data files such as XML, JSON and CSV.
- Designed dynamic partition mechanism for optimal query performance of system using HIVE to reduce report time generation under SLA requirements.
Environment: Hadoop 2.2, Informatica Power Center 9.x, HDFS, HBase, Flume 1.4, Sqoop 1.4.3, Hive 0.13.1, Avro 1.7.4, Parquet 1.4, XML, JSON, Oracle 11g, Amazon EC2, S3.
- Developed mappings/sessions to import, transform and load data into respective target tables and flat files using Informatica Power Center for data loading.
- Automation of the Informatica ETL jobs for different ETL design pattern.
- Extensively used Transformations like Router, Aggregator, Source Qualifier, Joiner, Expression, Aggregator and Sequence generator by using Source Analyzer, Warehouse Designer, Mapping Designer & Mapplet, and Transformation Developer.
Environment: Informatica Power Center 9.x (Repository Manager, Designer, Workflow Manager, and Workflow Monitor), Oracle 11g, SeaQuest, HPDM, SQL Server, Teradata, Toad, Control-M.
- Extensively used Slowly Changing Dimensions technique for updating dimensional schema.
- Processed data using various transformations like Aggregator, Router, Expression, Source Qualifier, Filter, Lookup, Joiner, Sorter, XML Source qualifier and web-consumer for WSDL.
- Used Informatica user defined functions to reduce the code dependency.
Environment : Informatica Power Center 8.x, Informatica Power Connect, Power Exchange, Power Analyzer, Toad, Erwin, Oracle 11g/10g, Teradata V2R5, PL/SQL, ODI, Trillium 11.
- Used SSIS as an Extract Transform Loading (ETL) tool of SQL Server to populate data from various data sources, creating packages for different data loading operations for application.
- Extensive use of Transact-SQL, stored procedures, trigger scripts for creating database objects.
- Generated various reports using features such as group by, drilldowns, drill through, sub-reports, Parameterized Reports.
- Deploying new strategies for checksum calculations, and exception population using mapplets and normalizer transformations.
Environment : SQL Server 2005, T-SQL, SSIS/DTS Designer and Reporting tools, Control-M.
- Developed the web applications using Spring MVC Framework including writing actions/ classes/ forms/ custom tag libraries and JSP pages.
- Worked on Integration of Spring and Hibernate Frameworks using Spring ORM Module.
- Implemented caching techniques, wrote POJO classes for storing data and DAO's to retrieve the data and did database configurations.