- 7+ years of IT work experience on Big Data Analytics which includes Analysis, Design, Development, Deployment & Maintenance of projects using Apache Hadoop - HDFS, Amazon S3, MapReduce, YARN, Hive, Impala, Hue, Sqoop, Flume, Spark, and Hadoop APIs’.
- Experience in meeting expectations with Hadoop clusters using Cloudera and MapR, in Agile Scrum methodologies.
- Experience in implementation and integration using Big Data Hadoop ecosystem components in Cloudera and MapR environments working with various file formats like Avro, Parquet, JSON and ORC.
- Experience in data ingestion, processing and analysis using Spark, Flume, Sqoop and Shell Script.
- Efficient in developing Sqoop jobs for migrating data from RDBMS to Hive / HDFS and vice versa.
- Experience in developing NoSQL applications using Mongo DB, HBase and Cassandra.
- Thorough knowledge of Hadoop architecture and core components Name node, Data nodes, Job trackers, Task Trackers, Oozie, Hue, Flume, HBase, etc.
- Very good experience of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
- Experience in extending HIVE and PIG core functionality by using Custom User Defined functions.
- Working knowledge on Oozie, a workflow scheduler system to manage teh jobs that run on PIG, HIVE and SQOOP.
- Experience in Spark applications using Python for easy Hadoop transitions.
- Knowledge of utilizing Flume technologies for real time data streaming and ingestion.
- Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and identifying data mismatch.
- Wrote multiple customized MapReduce Programs for various Input file formats.
- Developed multiple internal and external Hive Tables using dynamic partitioning & bucketing.
- Involved in converting SQL queries into HiveQL.
- Designing and creating Hive external tables using shared meta-store instead of teh derby with partitioning, dynamic partitioning and buckets.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experience in integrating Hive and HBase for effective operations.
- Design and development of full text search feature with multi-tenancy elastic search after collecting teh real time data through Spark.
- Experienced in working with Apache Spark ecosystem using Spark-SQL and Python queries on different data file formats like .txt, .csv etc.
- Developed data pipeline for real time use cases using Flume and PySpark and ability to develop MapReduce and Spark applications using Scala and Python
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and having hands-on experience in developing Data Frames and SQL queries in Spark SQL.
- Experience in analyzing large scale data to identify new analytics, insights, trends, and relationships with a strong focus on data clustering.
- An excellent team player with good organizational, interpersonal, communication skills and leadership qualities, Quick learner, possesses a positive attitude and flexibility towards teh ever- changing industry.
- Technically strong person who has capability to work with business users, project managers, team leads, architects and peers, thus maintaining healthy environment in teh project.
Distributed Computing: - Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Zookeeper, Hue, Impala, Oozie, Spark (Core & SQL), Pig, Flume
SDLC Methodologies: - Agile/Scrum, Waterfall
Databases: - MySQL, MSSQL, Oracle, NoSQL (HBase, Cassandra, Mongo DB), Teradata, Netezza.
Distributed Filesystems: - HDFS, Amazon S3
Distributed Query Engines: - Hive, Presto
Distributed Computing Environment: - Cloudera, MapR
Operating Systems: - Windows, Mac OS, Unix, Ubuntu
Programming Languages: - Java, Python, UNIX, Pig Latin, HiveQL
Scripting: - Shell Scripting
Version Control: - GitHub
IDE: - PyCharm, Jupyter Notebook, Eclipse (PyDev)
Confidential, Charlotte NC
Bigdata Hadoop Developer
- Involved in designing teh data pipeline using various Hadoop components like Spark, Hive, Impala, Sqoop with Cloudera distribution.
- Developed Spark (Python API) applications in PyCharm using spark core libraries (RDD, Spark- SQL) for performing ETL transformations, thereby eliminating teh need of utilizing ETL tool (SSIS/ODI).
- Developed Spark application for Batch processing.
- Implemented Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table for more efficient data.
- Using Open-Source packages, designed POC to demonstrate Integration of Flume with Spark SQL for real-time data Ingestion and processing.
- Designed Sqoop application for migration of sensitive PHI data residing on Netezza (RDBMS) to HDFS / Hive tables.
- Designed Sqoop Application for CDC (Change Data Capture) process and integrated it with Spark and Oozie.
- Performed data validation using Sqoop on teh exported data.
- Performed transformations / analysis by writing complex HQL queries in Hive and exported result to HDFS in discrete file format (JSON, AVRO, Parquet, ORC).
- Implemented data ingestion and transformation using automated workflows using Oozie.
- Utilizing Cloudera Navigator created Audit reports, which notifies security threat and will track all teh user / tools activity which uses various Hadoop components.
- Developed strategy for various Hadoop components used to track Data Lineage and Meta-Data extracted from Pipeline using Cloudera Navigator.
- Designed various plots showing HDFS Analytics and Other operations performed on teh environment.
- Worked with Infra team for testing teh environment after patches, upgrades and migration take place.
- Developed multiple python scripts for delivering End-To-End support and common routines, while maintaining product integrity.
- Performed analytical queries using Impala (Cloudera) for swifter responses and exported result to Tableau for data visualization.
- Documented all teh applications worked on and presented it to higher level.
Technologies Used - Cloudera, LINUX, Hadoop, HDFS, HBase, Hive, Spark, MapReduce, Sqoop, Flume, Python, Netezza, Oozie, PyCharm, Cloudera Impala, Tableau 10.0, GitHub.
Confidential, Charlotte NC
Bigdata Research Lab Intern
- Deployed a 5 node Hadoop cluster with all teh big data daemons like HDFS, Spark, Hive, Avro, Parque, Sqoop with 60 cores, 80 GB RAM and 25TB of storage.
- Done configurations of NameNode, Secondary NameNode, Resource Manager, Node Manager and Data nodes
- Used YARN as a resource manager for MapReduce and Spark applications
- Implemented Kafka for streaming data and filtered, processed teh data
- Executed various Analytical functions and Windowing functions using SparkSQL
- Worked on “Framework for Social Network Sentiment Analysis Using Big Data Analytics" project
- Worked as a “Graduate Teaching Assistant”
- Related coursework: Bigdata Solutions for Business.
Confidential, California, CA
- Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
- Analyzed large data sets by running Hive queries and exported teh results as views/ Flat files, etc.
- Worked with teh Data Science team to gather requirements for various data mining projects and conducted POC.
- Involved in creating Hive managed / external tables while maintaining raw files integrity and analyzed data using hive queries.
- Worked on Spark core Libraries like RDD, Spark SQL extensively by handling structured and unstructured data.
- Designed Spark applications performing ETL transformations using Python API.
- Developed Simple to complex MapReduce Jobs using Hive.
- Wrote Hive Queries to has a consolidated view of teh mortgage and retail data.
- Orchestrated hundreds of Sqoop scripts, pig scripts, hive queries using oozie workflows and sub- workflows.
- Used Hive to analyze teh partitioned and bucketed data and compute various metrics for reporting
- Developed Python utility to validate ingested Hive tables using Sqoop with source RDBMS tables
- Involved in running Hadoop jobs for processing millions of records of text data.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Responsible for managing data from multiple sources using Flume.
- Load and transform large data sets consisted of structured, semi structured and unstructured data.
- Created and maintained technical documentation for launching HADOOP Clusters and for executing Hive queries.
- Using Python libraries like Pandas, NumPy, and SciPy performed statistical analysis on dataset. Discovered patterns by evaluating parameters, which recommends top parameters to focus while designing.
- Integrated Tableau with Impala as a source to create interactive BI dashboard.
Technologies Used - Cloudera, LINUX, Hadoop, HDFS, Hive, Spark, MapReduce, Sqoop, Flume, Teradata, Python, MySQL, Oozie, Impala, Tableau 10.0.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Experience in defining job flows using Oozie and shell scripts.
- Experienced in implementing various customizations in MapReduce at various levels by implementing custom input formats, custom record readers, partitioners and data types in java.
- Experience in ingesting data using flume from web server logs and telnet sources.
- Installed and configured Cloudera Manager, Hive, Pig, Sqoop, and Oozie on CDH5 cluster.
- Experienced in managing disaster recovery cluster and responsible for data migration and backup.
- Performed an upgrade in development environment from CDH 4.x to CDH 5.x.
- Implemented encryption and masking on customer sensitive data in flume by building a custom interceptor and masking and encrypting teh data as per teh requirement by considering teh rules in MySQL.
- Experience in managing and reviewing Hadoop log files.
- Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
- Experience in running Hadoop streaming jobs to process terabytes of xml format data.
- Supported MapReduce Programs those are running on teh cluster.
- Involved in loading data from UNIX file system to HDFS.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
- Experiences in implementing Hive-HBase integration by creating hive external tables and using HBase storage handler.
- Executed queries using Hive and developed MapReduce jobs to analyze data.
- Developed Pig Latin scripts to extract teh data from teh web server output files to load into HDFS.
- Developed Hive queries for teh analysts.
- Involved in loading data from LINUX and UNIX filesystem to HDFS.
- Designed and implemented MapReduce based large scale parallel relation learning system.
- Developed Master tables in HIVE using a "jsondeserializer" or "get json object" or "json tuple" functions of HIVE.
- Designed teh entire flow in HDFS in such a way that is needed to be achieved using Oozie workflows.
Technologies Used - Cloudera, Eclipse, Hadoop, Hive, HBase, MapReduce, Flume, HDFS, PIG, Sqoop, Oozie, Cassandra, Java (JDK 1.6), My SQL, UNIX Shell Scripting.
Jr. Hadoop Developer
- Involved in creating Hive tables, loading with data and writing hive queries to process teh data.
- Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
- Implemented Partitioning, Bucketing in Hive for better organization of teh data.
- Involved with teh team of fetching live stream data from DB2 to HBase table using Flume.
- Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, XML and JSON files, ORC and Parquet).
- Involved in teh Design Phase for getting live event data from teh database to teh front-end application.
- Importing data from hive table and run SQL queries over imported data and existing RDD’s.
- Responsible for loading and transforming large sets of structured, semi structured and unstructured data.
- Collected teh log data from web servers and integrated into HDFS using Flume.
- Responsible to manage data coming from different sources.
- Extracted files from Couch DB and placed into HDFS using Sqoop and pre-process teh data for analysis.
- Developed teh sub queries in Hive.
- Partitioning and bucketing teh imported data using HiveQL.
- Partitioning dynamically using dynamic partition insert feature.
- Moving dis partitioned data onto teh different tables as per as business requirements.
Technologies Used - Eclipse, Hadoop, HDFS, Map Reduce, Pig, Hive, Flume, HBase, Couch DB, Apache- Maven.