- Over 7+ years of IT experience in domain of BigData using various Hadoopeco - system tools and SparkAPIs.
- Solid understanding of architecture, working of Hadoopframework involving Hadoop Distribute File System and its eco-system components MapReduce, Pig, Hive, HBase, Flume, Sqoop, Hue, Ambari, Zoo Keeper and Oozie, Storm, Spark, Kafka.
- Experience in building highly reliable, scalable Big data solutions on Hadoop distributions Cloudera, Horton works, AWS EMR.
- Good experience in working with different ETL tool environments like SSIS, Informatica and reporting tool environments like SQL Server Reporting Services (SSRS), Cognos and Business Objects.
- Good experienced in DataModeling and DataAnalysis as a Proficient in gathering business requirements and handling requirements management.
- Hands on experience in Normalization (1NF, 2NF, 3NF and BCNF) Denormalization techniques for effective and optimum performance in OLTP and OLAP environments.
- Experience in transferring the data using Informatica tool from AWS S3 to AWSRedshift
- Extensive experience in performing ETL on structured, semi-structureddata using PigLatin Scripts.
- Managed ELDMLogical and PhysicalData Models in ER Studio Repository based on the different subject area requests for integrated model.
- Expertise in moving structuredschema data between Pig and Hive using HCatalog.
- Creating data models (ERD, logical) including robust data definitions, which may be entity-relationship-attribute models, star, and snowflakemodels
- Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
- Solid knowledge of Data Marts, Operational Data Store (ODS),OLAP, Dimensional Data Modeling with Ralph KimballMethodology (Star Schema Modeling, Snow-Flake Modeling forFACT and Dimensions Tables) using Analysis Services.
- Expertise in Data Architect, Data Modeling, Data Migration, Data Profiling, Data Cleansing, Transformation, Integration, Data Import, and Data Exportthrough the use of multiple ETLtools such as Informatica Power Centre.
- Good understanding and exposure to Python programming.
- Experience in migrating the data using Sqoop from HDFS and Hive to Relational Database System and vice-versa according to client's requirement.
- Experience with RDBMS like SQL Server, MySQL, Oracle and data warehouses like Teradata and Netezza.
- Proficient knowledge and hands on experience in writing shell scripts in Linux.
- Experience on developing MapReduce jobs for data cleaning and data manipulation as required for the business.
- Good Experience on importing and exporting the data from HDFS and Hive into Relational Database Systems like MySQL and vice versa using Sqoop.
- Good knowledge on NoSQL Databases including HBase, MongoDB, MapR-DB.
- Installation, configuration and administration experience in Big DataplatformsCloudera Manager of Cloudera, MCS of MapR.
- Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
- Familiar with Amazon Web Services along with provisioning and maintaining AWS resources such as EMR, S3 buckets, EC2instances, RDS and others.
- Strong Knowledge of Data Warehouse Architecture and Star Schema, Snow flake Schema,FACT and Dimensional Tables.
- Experience in SQL and good knowledge in PL/SQL programming and developed StoredProcedures and Triggers and Data Stage, DB2, Unix, Cognos, MDM, Hadoop, Pig.
Big Data & Hadoop Ecosystem: MapReduce, Spark 2.3, HBase 1.2, Hive 2.3, Pig 0.17, Solr 7.2, Flume 1.8, Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hue, Cloudera Manager, Stream sets, Neo4j, Hadoop 3.0, Apache Nifi 1.6, Cassandra 3.11
Data Modeling Tools: Erwin R9.7/9.6, ER Studio V17
BI Tools: Tableau 10, Tableau server 10, Tableau Reader 10, SAP Business Objects, Crystal Reports
Databases: Oracle, DB2, SQL Server.
RDBMS: Microsoft SQL Server 2017, Teradata 15.0, Oracle 12c, and MS Access
Operating Systems: Microsoft Windows Vista7/8 and 10, UNIX, and Linux.
Packages: Microsoft Office 2016, Microsoft Project 2016, SAP and Microsoft Visio, Share point Portal Server
Project Execution Methodologies: Agile, Ralph Kimball and BillInmon’s data warehousing methodology, Rational Unified Process (RUP), Rapid Application Development (RAD), Joint Application Development (JAD)
Confidential - Durham, NC
Sr. Big Data Engineer
- Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
- Set up AWScloud environment and on S3 storage and EC2 instances
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Implemented MapReduce programs to retrieveresults from unstructured data set.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
- Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
- Importing and exporting data into HDFS and Hive using Sqoop from Oracleand vice versa.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Provided thought leadership for architecture and the design of Big Data Analytics solutions for customers, actively drive Proof of Concept (POC) and Proof of Technology (POT) evaluations and to implement a Big Data solution.
- Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2.
- Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQLqueries into Spark transformations using SparkRDDs and Scala.
- Installed and configured Pigand also written Pig Latin scripts.
- Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
- Analyzed the SQL scripts and designed the solution to implement using Scala.
- Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
- Implemented POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
- Developed Sparkscripts by using Python and Scala shell commands as per the requirement.
- Involvevd in batch processing of data sources using ApacheSpark, Elastic search.
- Developed Spark jobs usingScala in test environment for faster data processing and used Spark SQL for querying.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Designed and implemented SOLR indexes for the metadata that enabled internal applications to reference Scopus content.
- Extensively worked on Shell scripts for running SAS programs in batch mode on UNIX.
- Wrote Python scripts to parse XML documents and load the data in database.
- Used Python to extract weekly information from XML files.
- Developed Python scripts to clean the raw data.
- Used Spark for Parallel data processing and better performances using Scala.
- Extensively used Pig for data cleansing and extract the data from the web server output files to load into HDFS.
- Developed a data pipeline usingKafka and Storm to store data into HDFS.
- Implemented Kafkaproducers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Involved in creating Hive tables, loading with data and writing hivequeries which will run internally in MapReduce way.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using MapReduce programs.
- Developed simple to complex MapReduce streaming jobs using Python.
Environment: Pig 0.17, Hive 2.3, HBase 1.2, Sqoop 1.4, Flume 1.8, Cassandra 3.11, zookeeper, AWS, MapReduce, HDFS, Oracle, Cloudera, Scala, Spark 2.3, SQL, Apache Kafka 1.0.1, Apache Storm, Python, Unix and SOLR 7.2
Confidential - Newport Beach, CA
Sr. Data Engineer
- Architected, Designed and Developed Business applications and Data marts for reporting.
- Developed Big Data solutions focused on pattern matching and predictive modeling
- Objective of this project is to build a data lake as a cloud based solution in AWS using Apache Spark.
- Implemented Installation and configuration of multi-node cluster on Cloud using AmazonWeb Services(AWS) on EC2.
- Created Hive External tables to stage data and then move the data from Staging to main tables
- Worked in exporting data from Hive tables into Netezza database.
- Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
- Pulled the data from data lake (HDFS) and massaging the data with various RDD transformations.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster processing of data.
- Created Data Pipeline using Processor Groups and multiple processors using Apache NiFi for Flat File, RDBMS as part of a POC using AmazonEC2.
- Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
- Load the data from different sources such as HDFS or HBase into Spark RDD and implement in memory data computation to generate the output response.
- Developed complete end to end Big-data processing in Hadoop eco system.
- Used AWSCloud with Infrastructure Provisioning / Configuration.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Worked on configuring and managing disaster recovery and backup on Cassandra Data.
- Utilized Oozie workflow to run Pig and Hive Jobs Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
- Continuously tuned Hive UDF's for faster queries by employing partitioning and bucketing.
- Implemented partitioning, dynamic partitions and buckets in Hive.
- Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile and network devices and pushed to HDFS.
- Supported in setting up QA environment and updating configurations for implementing scripts with Pig, Hive and Sqoop.
Environment: Apache Spark, Hive 2.3, Informatica, HDFS, MapReduce, Scala, Apache Nifi 1.6, Yarn, HBase, PL/SQL, Mongo DB, Pig 0.16, Sqoop 1.2, Flume 1.8
Confidential - Greensboro, NV
- Involved in converting Hive/SQL queries into Spark transformations using Spark SQL, Python and Scala.
- Worked extensively with the NoSQL databases like MongoDB and Cassandra.
- Moved Relational Data base data using Sqoop into HiveDynamicpartitiontables using staging tables.
- Provided technical support during delivery of MDM(Master Data Management) components.
- Developed Spark scripts by using Scalashell commands as per the requirement.
- Extensively worked on the core and SparkSQL modules of Spark.
- Used SparkAPI over HadoopYARN to perform analytics on data in Hive.
- Worked with Data Governance, Data Quality and Metadata Management team to understand project.
- Implemented Optimized join base by joining different data sets to get top claims based on state using MapReduce.
- Created HBase tables to store various data formats of data coming from different sources.
- Responsible for importing log files from various sources into HDFS using Flume.
- Worked on analyzing Hadoop stack and different big data analytic tools including Pig, Hive, HBase database and Sqoop.
- Done Proof of Concept in Apache Nifi workflow in place of Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Designed Nifi to pull data from various sources and push it in HDFS and Cassandra.
- Integrating bulk data into Cassandra file system using MapReduce programs.
- Worked with Nifi for managing the flow of data from source to HDFS.
- Created customized BI tool for manager team that perform Query analytics using HiveQL.
- Used Hive and Pig to generate BIreports.
- Created Partitions, Buckets based on State to further process using Bucket based Hivejoins.
- Worked on custom PigLoaders and storage classes to work with variety of data formats in XML file formats.
- Used ApacheKafka for tracking data ingestion to Hadoopcluster.
- Integrating ApacheKafka with ApacheStorm and created Stormdatapipelines for real time processing.
- Used Oozie workflow engine to manage interdependent Hadoopjobs and to automate several types of Hadoopjobs such as Hive, Pig, and Sqoop.
- Used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Used Impala for dataanalysis.
- Experienced in Monitoring Cluster using Clouderamanager.
Environment: Hadoop, HDFS, HBase, MongoDB, MapReduce, Hive, Pig, Sqoop, Flume, Spark, Oozie, Kafka, SQL, ETL, Cloudera Manager, MySQL
Confidential - Stillwater, OK
Data Modeler/Data Architect
- Responsible for the data architecture design delivery, data model development, review, approval and Data warehouse implementation.
- Designed and developed the conceptual then logical and physicaldatamodels to meet the needs of reporting.
- Familiarity with a NoSQL database such as MongoDB.
- Involved in designing and developing Data Modelsand Data Marts that support the BusinessIntelligenceData Warehouse.
- Implemented logical and physicalrelationaldatabase and maintained Database Objects in the data model using Erwin9.5
- Responsible for Bigdata initiatives and engagement including analysis, brainstorming, POC, and architecture.
- Used SDLC Methodology of Data Warehouse development using Kanbanize.
- Worked with Hadoop eco system covering HDFS, HBase, YARN and MapReduce.
- Performed the DataMapping, Data design (Data Modeling) to integrate the data across the multiple databases in to EDW.
- Designed both 3NF Data models and dimensionalDatamodels using Star and Snowflakeschemas.
- Involved in Normalization/Denormalization techniques for optimum performance in relational and dimensional database environments.
- Developed Masterdatamanagement strategies for storing reference data.
- Worked with DataStewards and Businessanalysts to gather requirements for MDMProject.
- Involved in Testing like Unit testing, System integration and regression testing.
- Worked with SQLServerAnalysisServices (SSAS) and SQLServerReporting Service (SSRS).
- Worked on Data modeling, AdvancedSQL with Columnar Databases using AWS.
- Perform reverseengineering of the dashboard requirements to model the required datamarts.
- Developed Source to Target Matrix with ETLtransformation logic for ETL team.
- Cleansed, extracted and analyzed business data on daily basis and prepared ad-hoc analytical reports using Excel and T-SQL
- Created DataMigration and Cleansing rules for the Integration Architecture (OLTP, ODS, DW).
- Handled performance requirements for databases in OLTP and OLAPmodels.
- Conducted meetings with business and development teams for data validation and end-to-end data mapping.
- Responsible for Metadata Management, keeping up to date centralized metadata repositories using Erwin modeling tools.
- Involved in debugging and Tuning the PL/SQL code, tuning queries, optimization for the Sql database.
- Lead datamigration from legacy systems into modern data integration frameworks from conception to completion.
- Generated ad-hoc SQLqueries using joins, database connections and transformation rules to fetch data from legacy DB2 and SQLServer 2014 database systems..
- Managed the meta-data for the Subject Area models for the DataWarehouse environment.
- Generated DDL and created the tables and views in the corresponding architectural layers.
- Handled importing of data from various data sources, performed transformations using MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop
- Involved in performing extensive Back-End testing by writing SQLqueries and PL/SQL stored procedures to extract the data from SQL Database.
- Participate in code/design reviews and provide input into best practices for reports and universe development.
- Involved in Netezza Administration Activities like backup/restore, performance tuning, and Security configuration
- Involved in the validation of the OLAP, Unittesting and System Testing of the OLAP Report Functionality and data displayed in the reports.
- Created a high-level industry standard, generalized data model to convert it into logical and physical model at later stages of the project using Erwin and Visio
- Participated in Performance Tuning using ExplainPlan and TKPROF.
- Involved in translating business needs into long-term architecture solutions and reviewing object models, data models and metadata.
Environment: Erwin 9.0, HDFS, HBase, Hadoop, Metadata, MS Visio, SQL Server 2016, SDLC, PL/SQL, ODS, OLAP, OLTP, flat files.
- Interacted with business users to identify and understand business requirements and identified the scope of the projects.
- Identified and designed business Entities and attributes and relationships between the Entities to develop a logicalmodel and later translated the model into physicalmodel.
- Developed normalizedLogical and Physical database models for designing an OLTP application.
- Enforced Referential Integrity (R.I) for consistent relationship between parent and child tables.
- Work with users to identify the most appropriate source of record and profile the data required for sales and service.
- Involved in defining the business/transformation rules applied for ICP data.
- Define the list codes and code conversions between the source systems and the data mart.
- Developed the financing reporting requirements by analyzing the existing business objects reports
- Utilized Informatica toolset (Informatica Data Explorer, and Informatica Data Quality) to analyze legacy data for dataprofiling.
- Reverse Engineered the Data Models and identified the Data Elements in the source systems and adding new DataElements to the existing datamodels.
- Created XSD's for applications to connect the interface and the database.
- Compare data with original source documents and validate Dataaccuracy.
- Used reverse engineering to create Graphical Representation (E-R diagram) and to connect to existing database.
- Generate weekly and monthly asset inventory reports.
- Evaluated data profiling, cleansing, integration and extraction tools (e.g. Informatica)
- Coordinate with the business users in providing appropriate, effective and efficient way to design the new reporting needs based on the user with the existing functionality
- Worked on some impact of low quality and/or missing data on the performance of data warehouse client.
- Worked with NZLoad to load flat file data into Netezza tables. Good understanding about Netezza architecture.
- Identified design fl in the data warehouse and executed DDL to create databases, tables and views.
- Generated comprehensive analytical reports by running SQLqueries against current databases to conduct dataanalysis.
- Involved in DataMapping activities for the datawarehouse.
- Created and Configured Workflows, Work lets, and Sessions to transport the data to target warehouse Netezzatables using Informatica Workflow Manager.
- Extensively worked on Performance Tuning and understanding Joins and Data distribution.
- Coordinated with DBAs and generated SQL codes from data models.
- Generatereports using crystal reports for better communication between business teams.
Environment: SQL/Server, Oracle9i, MS-Office, Embarcadero, Crystal Reports, Netezza, Teradata, Enterprise Architect, Toad, Informatica, ER Studio, XML, Informatica, OBIEE