- Over 5 years of working experience including 3+ years of experience in Hadoop Development along with 2+ years of experience in Data Analyst.
- Worked in various domains including luxury, telecommunication.
- Excellent understanding of Hadoop architecture and various components such as HDFS, YARN, High Availability, and MapReduce programming paradigm.
- Hands on experience in installing, configuring, and using Hadoop ecosystem components like Hadoop 2.x, MapReduce 2.x, HDFS, HBase, Oozie, Hive, Kafka, Oozie, Zookeeper, Spark, Storm, Sqoop and Flume.
- Experience in analyzing data using HiveQL, HBase and custom MapReduce programs in Java.
- Configured and implemented applications with messaging systems, Kafka and RabbitMQ, to guarantee data quality in high - speed processing.
- Extended Pig and Hive core functionality by writing custom UDFs.
- Wrote Ad-hoc queries for analyzing the data using HIVE QL.
- Developed real-time read write access to very large datasets via Hbase.
- Experience in integration of various data sources in RDMS like Oracle, SQL Server.
- Used NoSQL Database including Hbase, MongoDB, Cassandra.
- Implemented Sqoop jobs for large sets of structured and semi-structured data migration between HDFS and/or other data storage like Hive or RDBMS.
- Extracted data from log files and push into HDFS using Flume.
- Scheduled workflow using Oozie workflow Engine.
- Consolidated MapReduce jobs by implementing Spark, to decreas data processing time.
- Fluent in programming languages, such as JAVA and Scala
- Used Maven to achieve source building framework.
- Experienced in Agile and Waterfall methodologies.
- Fluent in Data Mining and Machine Learning, such as classification, clustering, regression and anomaly detection.
- Knowledge of Social Network and Graph Theory.
- Successfully working in fast-paced environment, both independently and in collaborative team environments.
Distributed FileSystem\ Distributed Programming: HDFS 2.6.0\ MapReduce 2.6.x, Pig 0.12, Spark 1.3
Hadoop Library\ NoSQL: Mahout, MLlib\ HBase 0.98, MongoDB, Cassandra
Relational Databases: Distribution based on Hadoop Oracle 11g/10g/9i/, MySQL 5.0, SQL Server\ Cloudera Distribution (CDH4, CM)
SQLOn: Hive 0.12, Cloudera Impala 2.0.x\ Kafka 0.8.x, RabbitMQ 3.4.x, Flume 1.3.x, Sqoop 1.4.4, Storm 0.9, Kafka 0.8\
Scheduling\ Languages: Oozie 4.0.x, Falcon\ Java, Python, Scala, UNIX Shell, Scripting, SQL, C, C++
Service Programming\ Tools: Zookeeper 3.3.6\ Eclipse, Git, Maven,Tableau
Operation Systems\ Methodologies: Linux (CentOS, Ubuntu), Mac OS, Windows\ Agile, Waterfall
Confidential, New York, NY
Big Data/ Hadoop Developer
- Installed, configured, monitored and maintained Hadoop clusters on Big Data platform.
- Configured Zookeeper, worked on Hadoop High Availability with Zookeeper failover controller, add support for scalable, fault-tolerant data solutions.
- Upgraded system to Spark environment.
- Wrote multiple MapReduce programs in Scala for data cleaning from multiple file formats.
- Monitored Kafka centered messaging system, including architectures and consumer development. Explored RabbitMQ mechanism for delivering high quality data.
- Tested Spark Streaming to optimize streaming process and guarantee data quality.
- Created multiple Hive tables, implemented partitioning, dynamic partitioning and buckets in Hive for efficient data access.
- Used Flume to collect, aggregate, and store dynamic web log data from different sources like web servers, mobile devices and pushed to HDFS.
- Stored and fast update data in Hbase, provided key based access to specific data.
- Configured Spark to optimize data process.
Environment: Hadoop 2.4.x, HDFS, MapReduce 2.4.0, YARN 2.6.2, Kafka 0.8.1, RabbitMQ 3.4.3, Spark1.1.1, Hive 0.13.0, HBase 0.94.0, Sqoop 1.99.2, Flume 1.5.0, Oozie 4.0.0, Zookeeper 3.4.2.
Confidential, New York, NY
- Responsible for loading customer's data and event logs into HBase using Java API.
- Created HBase tables to store variable data formats of input data coming from different portfolios
- Involved in adding huge volumes of data in rows and columns to store data in HBase.
- Responsible for architecting Hadoop clusters with CDH4 on CentOS, managing with Cloudera Manager.
- Involved in initiating and successfully completing Proof of Concept on Flume for Pre-Processing.
- Extracted files from Oracle through Sqoop and placed in HDFS and processed.
- End-to-end performance tuning of Hadoop clusters and Hadoop MapReduce routines against very large data sets.
- Created and maintained Technical documentation for launching Hadoop Clusters and for executing Hive queries.
- Created User accounts and given the users the access to the Hadoop Cluster.
- Developed the Hive UDF's to pre-process the data for analysis.
- Responsible for using Oozie to control workflow.
Environment: Hadoop 2.0, HDFS, Pig 0.11, Hive 0.12.0, MapReduce 2.5.2, Sqoop, LINUX, Flume 1.94, Kafka 0.8.1, HBase 0.94.6, CDH4, Oozie 3.3.0.
Big Data Analyst
- Involved in review of functional and nonfunctional requirements.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in Java for data clients and preprocessing.
- Implemented some business logics by writing UDFs in Java and used various UDFs from Pig banks and other sources to get some results from the data.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Involved in defining job flows.
- Involved in managing and reviewing Hadoop log files by Flume.
- Involved in running Hadoop streaming jobs to process terabytes of xml format data.
- Load and transform large sets of structured, semi structured and unstructured data.
- Responsible to manage data coming from different sources.
- Proficient work experience on NOSQL database HBase.
- Supported MapReduce programs those are running on the Hadoop Cluster.
Environment: Java 7, Eclipse, Oracle 10g, Hadoop 2.4.x, Hive 0.12, HBase 0.92, Linux, MapReduce 2.x, HDFS, Kafka 0.8.1.
Big Data Analyst
- Worked on analyzing Hadoop cluster and different big data analytic tools such as HiveQL.
- Importing and exporting data in HDFS and Hive using Sqoop.
- Extracted BSON files from MongoDB and placed in HDFS and processed.
- Designed and developed MapReduce jobs to process data coming in BSON format.
- Worked on the POC to bring data to HDFS and Hive.
- Written Hive UDFs to extract data from staging tables.
- Involved in creating Hive tables, loading with data.
- Hands on writing MapReduce code to make unstructured data as structured data and for inserting data into.
- Experience in creating integration between Hive and HBase.
- Familiarized with job scheduling using Fair Scheduler so that CPU time is well distributed amongst all the jobs.
- Used Oozie scheduler to submit workflows.
- Review QA test cases with the QA team.
Environment: Hadoop 1.2.1, Java JDK 1.6, MapReduce 1.x, Hbase 0.70, MySQL, MongoDB, Oozie 3.x.
- Gather Requirements that are to be incorporated into the system.
- Extensively worked on the analysis of Tables in both Legacy Data store and new data store.
- Extensively worked on the analysis of Columns in mapping tables for both Legacy Data store and new data store.
- Initialized utilization of Data Warehouse ETL software during conversion of data to Oracle DB.
- Developed the complete documentation of the project based on the analysis of tables and Columns.
- Created DDL scripts to create, alter, drop tables, views, synonyms and sequences.
- Worked on SQL Tables, Records and Collections.
- Wrote SQL Procedures, Functions, and Triggers for Insert, Update and Delete transactions and optimized for maximum performance.
- Extensively worked on the Database Triggers, Stored Procedures, Functions and Database Constraints.
- Developed SQL queries to fetch complex data from different tables in remote databases using database links.
- Used ETL process to identify the new or the changed data in order to make better decisions in the project.
- Participated in Performance Tuning of SQL queries using Explain Plan to improve the performance of the application.
- Source data residing in Excel formats are exported to flat files and then accessed via Oracle external tables in order to load into the staging schema, at which point all source data can be efficiently transformed and migrated to the target schema.
- Extracted data from Flat files using SQL*LOADER.
- Developed Unix Shell Scripts for loading data into the database using SQL* Loader.
- Created partitions on the tables to improve the performance.
- Participated in application planning, design activities by interacting and collecting requirements from the end users.
Environment: Oracle 11g, SQL Developer, SQL Tuning, SQL*Loader, UNIX Shell Scripting