- Around 5+ years of overall IT experience along with 2+ years in Hadoop (Cloudera Distribution CDH 4 and 5) on cluster of 30 nodes.
- Worked with data with size of over 60 TB.
- Extensive experience in HDFS, Sqoop, Flume, Hive, Pig, Spark, Oozie, Impala.
- Good understanding and working experience on Hadoop Distributions like Cloudera and Hortonworks.
- Experience in importing and exporting multi terabytes of data using Sqoop from Relational Database Management System to HDFS and vice versa.
- Experience in using HiveQL of querying and analyzing large datasets.
- Experience in writing simple to complex Pig scripts for processing and analyzing large volumes of data.
- Querying both Managed and External tables created in Hive using Impala.
- Extensive experience with ETL and query big data tools HiveQL and Pig Latin.
- Experience in loading logs from multiple sources into HDFS using Flume.
- Experience with Oozie workflow engine in running jobs with actions that run Sqoop, Pig and Hive jobs.
- Experience in using Spark API over Map Reduce to perform analytics on data.
- Experience in creating Resilient Distributed Datasets for the input data and data transformations using PySpark.
- Experienced with Spark processing framework such as Spark SQL.
- Experience in Data Warehousing and ETL processes.
- Experience in processing large datasets of different forms like structured, semi - structured and unstructured data.
- Experience in working with different file formats like Avro , Parquet , ORC , Sequence , and JSON files.
- Background with traditional databases such as MySQL, SQL Server
- Good analytical, interpersonal, communication, problem solving skills with ability to quickly master new concepts and capable of working in group as well as independently.
Hadoop Distribution: Cloudera, Hortonworks
Big Data Ecosystem: HDFS, Sqoop, Flume, Hive, Pig, Impala, Oozie, Spark
Databases: MySQL, MS SQL Server
NoSQL / Storage: Hbase, AWS Redshift, S3, EMR
Languages: Java, Python
Operating System: Windows XP/7/8/10, Linux, Mac OS
Confidential, Walnut Creek, CA
- Worked on Cloudera CDH 5.4 distribution of Hadoop.
- Extensively working with MySQL for identifying required tables and views to export into HDFS.
- Responsible for moving data from MySQL to HDFS to development cluster for validation and cleansing.
- Responsible for creating Hive tables on top of HDFS and developed Hive Queries to analyze the data.
- Developed Hive tables on data using different SERDE’s, storage format and compression techniques.
- Optimized the data sets by creating Dynamic Partition and Bucketing in Hive.
- Used Pig Latin to analyze datasets and perform transformation according to requirements.
- Implemented Hive custom UDF’s for comprehensive data analysis.
- Involved in loading data from local file systems to Hadoop Distributed File System.
- Experience working with SparkSQL and creating RDD’s using PySpark.
- Extensive experience working with ETL of large datasets using PySpark in Spark on HDFS.
- Developed ETL workflow which pushes web server logs to an Amazon S3 bucket.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Sqoop script, Pig script, Hive queries.
- Exporting data from HDFS environment into RDBMS using Sqoop.
Confidential, Denver, CO
- Worked on live 30 nodes Hadoop cluster running CDH 4.4
- Worked with highly unstructured and semi structured data of 20TB in size.
- Responsible for building scalable distributed data solutions using Hadoop .
- Managing data from various file system to HDFS using UNIX command line utilities.
- Involved in importing and exporting data between RDBMS and HDFS using Sqoop.
- Creating Hive tables on top of the loaded data and writing hive queries for adhoc analysis.
- Implemented Partitioning, Dynamic Partition, and Bucketing in Hive for efficient data access.
- Performed querying of both managed and external tables created by Hive using Impala.
- Developed Pig scripts for data analysis and perform transformation.
- Implemented Pig as ETL tool to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Developed Spark code and Spark SQL for faster testing and processing of data.
- Involved in converting Hive SQL queries into Spark transformation using Spark RDD’s, Python.
- Developed UNIX shell scripts to load large number of files into HDFS from Linux File System.
- Implemented Oozie workflow for Sqoop, Pig and Hive actions.
- Exported the analyzed data to the relational databases using Sqoop.
- Debugged the results to find if there is any missing at the outcome.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
- Involved in performance tuning and fixing bugs.
- Involved in database development and creating SQL scripts.
- Involved in Requirement Study, UI Design, Development, Implementation, Code Review, Validation, Testing.
- Managed database related activities.
- Designed tables and indexes.
- Writing SQL queries to fetch the business data.
- Developed Views, Sequence and indexes.
- Created Joins and Sub queries involving multiple tables.
- Analyzing SQL data, identifying issues and modifying the SQL scripts to fix the issues.
- Involved in trouble shooting and fine tuning of databases for its performance and concurrency.
- Involved in fixing bugs and different forms of testing including black and white box testing .
- Handling issues regarding database, its connectivity and maintenance.
- Manage the priorities, deadlines and deliverables of individual project and issues related to it.
- Effectively prioritize work while considering business need and urgency.
- Worked effectively and efficiently on multiple tasks and deadlines and produces high quality results.
- Involved in performance improvement of web application for user friendly experience and solving a critical issue that happens in the production environment.