- Over 8+ years of IT experience in domain of Big Data using various Hadoop eco - system tools and Spark APIs.
- Solid understanding of architecture, working of Hadoop framework involving Hadoop Distribute File System and its eco-system components MapReduce, Pig, Hive, HBase, Flume, Sqoop, Hue, Ambari, Zoo Keeper and Oozie, Storm, Spark, Kafka.
- Experience in building highly reliable, scalable Big data solutions on Hadoop distributions Cloudera, Horton works, AWS EMR.
- Good experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS), Cognos and Business Objects.
- Good experienced in Data Modeling and Data Analysis as a Proficient in gathering business requirements and handling requirements management.
- Hands on experience in Normalization (1NF, 2NF, 3NF and BCNF) Denormalization techniques for effective and optimum performance in OLTP and OLAP environments.
- Experience in transferring the data using Informatica tool from AWS S3 to AWS Redshift
- Extensive experience in performing ETL on structured, semi-structured data using Pig Latin Scripts.
- Managed ELDM Logical and Physical Data Models in ER Studio Repository based on the different subject area requests for integrated model.
- Expertise in moving structured schema data between Pig and Hive using HCatalog.
- Creating data models (ERD, logical) including robust data definitions, which may be entity-relationship-attribute models, star, and snowflake models
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Solid knowledge of Data Marts, Operational Data Store (ODS),OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
- Expertise in Data Architect, Data Modeling, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Export through the use of multiple ETL tools such as Informatica Power Centre.
- Good understanding and exposure to Python programming.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
- Experience with RDBMS like SQL Server, MySQL, Oracle and data warehouses like Teradata and Netezza.
- Proficient knowledge and hands on experience in writing shell scripts in Linux.
- Experience on developing MapReduce jobs for data cleaning and data manipulation as required for the business.
- Good Experience on importing and exporting the data from HDFS and Hive into Relational Database Systems like MySQL and vice versa using Sqoop.
- Good knowledge on NoSQL Databases including HBase, MongoDB, MapR-DB.
- Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of MapR.
- Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
- Familiar with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, RDS and others.
- Strong Knowledge of Data Warehouse Architecture and Star Schema, Snow flake Schema, FACT and Dimensional Tables.
- Experience in SQL and good knowledge in PL/SQL programming and developed Stored Procedures and Triggers and Data Stage, DB2, Unix, Cognos, MDM, Hadoop, Pig.
Big Data & Hadoop Ecosystem: MapReduce, Spark 2.3, HBase 1.2, Hive 2.3, Pig 0.17, Solr 7.2, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hue, Cloudera Manager, Stream sets, Neo4j, Hadoop 3.0, Apache Nifi 1.6, Cassandra 3.11
Data Modeling Tools: Erwin R9.7/9.6, ER Studio V17
BI Tools: Tableau 10, Tableau server 10, Tableau Reader 10, SAP Business Objects, Crystal Reports
Databases: Oracle, DB2, SQL Server.
RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access
Operating Systems: Microsoft Windows Vista7/8 and 10, UNIX, and Linux.
Packages: Microsoft Office 2016, Microsoft Project 2016, SAP and Microsoft Visio, Share point Portal Server
Project Execution Methodologies: Agile, Ralph Kimball and BillInmon’s data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD)
Confidential - Durham, NC
Sr. Big Data Engineer
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Set up AWS cloud environment and on S3 storage and EC2 instances
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Implemented MapReduce programs to retrieve results from unstructured data set.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
- Importing and exporting data into HDFS and Hive using Sqoop from Oracle and vice versa.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution.
- Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2.
- Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Installed and configured Pig and also written Pig Latin scripts.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
- Implemented POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
- Developed Spark scripts by using Python and Scala shell commands as per the requirement.
- Involved in batch processing of data sources using Apache Spark, Elastic search.
- Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Designed and implemented SOLR indexes for the metadata that enabled internal applications to reference Scopus content.
- Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
- Wrote Python scripts to parse XML documents and load the data in database.
- Used Python to extract weekly information from XML files.
- Developed Python scripts to clean the raw data.
- Used Spark for Parallel data processing and better performances using Scala.
- Extensively used Pig for data cleansing and extract the data from the web server output files to load into HDFS.
- Developed a data pipeline using Kafka and Storm to store data into HDFS.
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using MapReduce programs.
- Developed simple to complex MapReduce streaming jobs using Python.
Environment: Pig 0.17, Hive 2.3, HBase 1.2, Sqoop 1.4, Flume 1.8, Cassandra 3.11, zookeeper, AWS, MapReduce, HDFS, Oracle, Cloudera, Scala, Spark 2.3, SQL, Apache Kafka 1.0.1, Apache Storm, Python, Unix and SOLR 7.2
Confidential - Newport Beach, CA
Sr. Data Engineer
- Architected, Designed and Developed Business applications and Data marts for reporting.
- Developed Big Data solutions focused on pattern matching and predictive modeling
- Objective of this project is to build a data lake as a cloud based solution in AWS using Apache Spark.
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services(AWS) on EC2.
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Worked in exporting data from Hive tables into Netezza database.
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Created Data Pipeline using Processor Groups and multiple processors using Apache NiFi for Flat File, RDBMS as part of a POC using AmazonEC2.
- Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
- Developed complete end to end Big-data processing in Hadoop eco system.
- Used AWS Cloud with Infrastructure Provisioning / Configuration.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Worked on configuring and managing disaster recovery and backup on Cassandra Data.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Implemented partitioning, dynamic partitions and buckets in Hive.
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
Environment: Apache Spark, Hive 2.3, Informatica, HDFS, MapReduce, Scala, Apache Nifi 1.6, Yarn, HBase, PL/SQL, Mongo DB, Pig 0.16, Sqoop 1.2, Flume 1.8
Confidential - Greensboro
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL, Python and Scala.
- Worked extensively with the NoSQL databases like MongoDB and Cassandra.
- Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Provided technical support during delivery of MDM(Master Data Management) components.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Extensively worked on the core and SparkSQL modules of Spark.
- Used Spark API over Hadoop YARN to perform analytics on data in Hive.
- Worked with Data Governance, Data Quality and Metadata Management team to understand project.
- Implemented Optimized join base by joining different data sets to get top claims based on state using MapReduce.
- Created HBase tables to store various data formats of data coming from different sources.
- Responsible for importing log files from various sources into HDFS using Flume.
- Worked on analyzing Hadoop stack and different big data analytic tools including Pig, Hive, HBase database and Sqoop.
- Done Proof of Concept in Apache Nifi workflow in place of Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Designed Nifi to pull data from various sources and push it in HDFS and Cassandra.
- Integrating bulk data into Cassandra file system using MapReduce programs.
- Worked with Nifi for managing the flow of data from source to HDFS.
- Created customized BI tool for manager team that perform Query analytics using HiveQL.
- Used Hive and Pig to generate BI reports.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats in XML file formats.
- Used Apache Kafka for tracking data ingestion to Hadoop cluster.
- Integrating Apache Kafka with Apache Storm and created Storm data pipelines for real time processing.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Hive, Pig, and Sqoop.
- Used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Used Impala for data analysis.
- Experienced in Monitoring Cluster using Cloudera manager.
Environment: Hadoop, HDFS, HBase, MongoDB, MapReduce, Hive, Pig, Sqoop, Flume, Spark, Oozie, Kafka, SQL, ETL, Cloudera Manager, MySQL
Confidential - Dallas, TX
Data Modeler/Data Architect
- Responsible for the data architecture design delivery, data model development, review, approval and Data warehouse implementation.
- Designed and developed the conceptual then logical and physical data models to meet the needs of reporting.
- Involved in designing and developing Data Models and Data Marts that support the Business Intelligence Data Warehouse.
- Implemented logical and physical relational database and maintained Database Objects in the data model using Erwin9.5
- Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Used SDLC Methodology of Data Warehouse development using Kanbanize.
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
- Performed the Data Mapping, Data design (Data Modeling) to integrate the data across the multiple databases in to EDW.
- Designed both 3NF Data models and dimensional Data models using Star and Snow flake schemas.
- Involved in Normalization/Denormalization techniques for optimum performance in relational and dimensional database environments.
- Developed Master data management strategies for storing reference data.
- Worked with Data Stewards and Business analysts to gather requirements for MDM Project.
- Involved in Testing like Unit testing, System integration and regression testing.
- Worked with SQL Server Analysis Services (SSAS) and SQL Server Reporting Service (SSRS).
- Worked on Data modeling, Advanced SQL with Columnar Databases using AWS.
- Perform reverse engineering of the dashboard requirements to model the required data marts.
- Developed Source to Target Matrix with ETL transformation logic for ETL team.
- Cleansed, extracted and analyzed business data on daily basis and prepared ad-hoc analytical reports using Excel and T-SQL
- Created Data Migration and Cleansing rules for the Integration Architecture (OLTP, ODS, DW).
- Handled performance requirements for databases in OLTP and OLAP models.
- Conducted meetings with business and development teams for data validation and end-to-end data mapping.
- Responsible for Metadata Management, keeping up to date centralized metadata repositories using Erwin modeling tools.
- Involved in debugging and Tuning the PL/SQL code, tuning queries, optimization for the Sql database.
- Lead data migration from legacy systems into modern data integration frameworks from conception to completion.
- Generated ad-hoc SQL queries using joins, database connections and transformation rules to fetch data from legacy DB2 and SQL Server 2014 database systems..
- Managed the meta-data for the Subject Area models for the Data Warehouse environment.
- Generated DDL and created the tables and views in the corresponding architectural layers.
- Handled importing of data from various data sources, performed transformations using MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Involved in performing extensive Back-End testing by writing SQL queries and PL/SQL stored procedures to extract the data from SQL Database.
- Participate in code/design reviews and provide input into best practices for reports and universe development.
- Involved in Netezza Administration Activities like backup/restore, performance tuning, and Security configuration
- Involved in the validation of the OLAP, Unit testing and System Testing of the OLAP Report Functionality and data displayed in the reports.
- Created a high-level industry standard, generalized data model to convert it into logical and physical model at later stages of the project using Erwin and Visio
- Participated in Performance Tuning using Explain Plan and TKPROF.
- Involved in translating business needs into long-term architecture solutions and reviewing object models, data models and metadata.
Environment:Erwin 9.0, HDFS, HBase, Hadoop, Metadata, MS Visio, SQL Server 2016, SDLC, PL/SQL, ODS, OLAP, OLTP, flat files.
- Interacted with business users to identify and understand business requirements and identified the scope of the projects.
- Identified and designed business Entities and attributes and relationships between the Entities to develop a logical model and later translated the model into physical model.
- Developed normalized Logical and Physical database models for designing an OLTP application.
- Enforced Referential Integrity (R.I) for consistent relationship between parent and child tables.
- Work with users to identify the most appropriate source of record and profile the data required for sales and service.
- Involved in defining the business/transformation rules applied for ICP data.
- Define the list codes and code conversions between the source systems and the data mart.
- Developed the financing reporting requirements by analyzing the existing business objects reports
- Utilized Informatica toolset (Informatica Data Explorer, and Informatica Data Quality) to analyze legacy data for data profiling.
- Reverse Engineered the Data Models and identified the Data Elements in the source systems and adding new Data Elements to the existing data models.
- Created XSD's for applications to connect the interface and the database.
- Compare data with original source documents and validate Data accuracy.
- Used reverse engineering to create Graphical Representation (E-R diagram) and to connect to existing database.
- Generate weekly and monthly asset inventory reports.
- Evaluated data profiling, cleansing, integration and extraction tools (e.g. Informatica)
- Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality
- Worked on some impact of low quality and/or missing data on the performance of data warehouse client.
- Worked with NZLoad to load flat file data into Netezza tables. Good understanding about Netezza architecture.
- Identified design fl in the data warehouse and executed DDL to create databases, tables and views.
- Generated comprehensive analytical reports by running SQL queries against current databases to conduct data analysis.
- Involved in Data Mapping activities for the data warehouse.
- Created and Configured Workflows, Work lets, and Sessions to transport the data to target warehouse Netezza tables using Informatica Workflow Manager.
- Extensively worked on Performance Tuning and understanding Joins and Data distribution.
- Coordinated with DBAs and generated SQL codes from data models.
- Generate reports using crystal reports for better communication between business teams.
Environment: SQL/Server, Oracle9i, MS-Office, Embarcadero, Crystal Reports, Netezza, Teradata, Enterprise Architect, Toad, Informatica, ER Studio, XML, Informatica, OBIEE