- 5 years of IT work experience on Big Data Analytics which includes Analysis, Design, Development, Deployment & Maintenance of projects using Apache Hadoop - HDFS, Amazon S3, MapReduce, YARN, Hive, Pig Latin, Impala, Hue, Sqoop, Kafka, Flume, Spark, Scala, Oozie and Hadoop APIs’.
- Experience in meeting expectations with Hadoop clusters using Cloudera and MapR, in Agile Scrum methodologies.
- Experience in implementation and integration using Big Data Hadoop ecosystem components in Cloudera and MapR environments working with various file formats like Avro, Parquet, JSON and ORC.
- Experience in data ingestion, processing and analysis using Spark with Scala, Spark Streaming, Kafka, Flume, Sqoop and Shell Script.
- Efficient in developing Sqoop jobs for migrating data from RDBMS to Hive / HDFS and vice versa.
- Experience in developing NoSQL applications using Mongo DB, HBase and Cassandra.
- Thorough knowledge of Hadoop architecture and core components Name node, Data nodes, Job trackers, Task Trackers, Oozie, Hue, Flume, HBase, etc.
- Very good experience of partitioning, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
- Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for data analysis.
- Experience in extending HIVE and PIG core functionality by using Custom User Defined functions.
- Working knowledge on Oozie, a workflow scheduler system to manage the jobs that run on PIG, HIVE and SQOOP.
- Experience in Spark applications using Scala for easy Hadoop transitions.
- Knowledge of utilizing Kafka / Flume technologies for real time data streaming and ingestion.
- Excellent knowledge in Data Analysis, Data Validation, Data Cleansing, Data Verification and identifying data mismatch.
- Wrote multiple customized MapReduce Programs for various Input file formats.
- Developed multiple internal and external Hive Tables using dynamic partitioning & bucketing.
- Involved in converting SQL queries into HiveQL.
- Designing and creating Hive external tables using shared meta-store instead of the derby with partitioning, dynamic partitioning and buckets.
- Experience in Oozie and workflow scheduler to manage Hadoop jobs by Direct Acyclic Graph (DAG) of actions with control flows.
- Experience in integrating Hive and HBase for effective operations.
- Design and development of full text search feature with multi-tenancy elastic search after collecting the real time data through Spark streaming.
- Experienced in working with Apache Spark ecosystem using Spark-SQL and Scala queries on different data file formats like .txt, .csv etc.
- Developed data pipeline for real time use cases using Kafka, Flume and Spark Streaming.
- Experience in analyzing large scale data to identify new analytics, insights, trends, and relationships with a strong focus on data clustering.
- An excellent team player with good organizational, interpersonal, communication skills and leadership qualities, Quick learner, possesses a positive attitude and flexibility towards the ever-changing industry.
- Technically strong person who has capability to work with business users, project managers, team leads, architects and peers, thus maintaining healthy environment in the project.
Distributed Computing: Apache Hadoop 2.x, HDFS, YARN, Map Reduce, Hive, Pig, HBase, Sqoop, Flume, Zookeeper, Hue, Impala, Oozie, Kafka, Spark
SDLC Methodologies: Agile Scrum
Relational Databases: Teradata, Netezza, Oracle, My SQL
Distributed Databases: No SQL (HBase, Cassandra, Mongo DB)
Distributed Filesystems: HDFS, Amazon S3
Distributed Query Engines: Hive, Preston
Distributed Computing Environment: Cloudera, MapR
Operating Systems: Windows, Mac OS, Unix, Ubuntu
Programming: Java, Python, Scala, UNIX, Pig Latin, HiveQL
Scripting: Shell Scripting
Version Control: GitHub
IDE: Scala IDE, PyCharm, Jupyter Notebook, Eclipse (PyDev)
Confidential, Phoenix, AZ
Hadoop / Spark Developer
- Worked with Hadoop Ecosystem components like HBase, Sqoop, Zookeeper, Oozie, Hive and Pig with MapR Hadoop distribution.
- Wrote Pig Scripts for sorting, joining, filtering and grouping the data.
- Developed programs in Spark based on the application for faster data processing than standard MapReduce programs.
- Developed spark programs using Scala, involved in creating Spark SQL Queries and Developed Oozie workflow for spark jobs.
- Developed the Oozie workflows with Sqoop actions to migrate the data from relational databases like Oracle, Teradata to HDFS.
- Used Hadoop FS actions to move the data from upstream location to local data locations.
- Written extensive Hive queries to do transformations on the data to be used by downstream models.
- Developed map reduce programs as a part of predictive analytical model development.
- Developed Hive queries to do analysis of the data and to generate the end reports to be used by business users.
- Worked on scalable distributed computing systems, software architecture, data structures and algorithms using Hadoop, Apache Spark and ingested streaming data into Hadoop using Spark Framework and Scala.
- Expertise working with NOSQL databases like MongoDB.
- Extensively used GIT as a code repository and Version One for managing day agile project development process and to keep track of the issues and blockers.
- Written spark python for model integration layer.
- Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data.
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Wrote new spark jobs in Scala to analyze the data of the customers and sales history.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Developed a data pipeline using Kafka, HBase, Mesos Spark and Hive to ingest, transform and analyzing customer behavioral data.
Technologies Used - MapR, LINUX, Hadoop, HBase, Hive, Impala, Oracle, Spark, Scala, Python, Pig, Sqoop, Teradata, Zookeeper, Oozie, MongoDB, Map Reduce, GitHub.
Confidential, Waltham, MA
Big Data Hadoop Engineer
- Involved in designing the data pipeline using various Hadoop components like Spark, Hive, Impala, Sqoop with Cloudera distribution.
- Developed Spark (Python API) applications in PyCharm using spark core libraries (RDD, Spark-SQL) for performing ETL transformations, thereby eliminating the need of utilizing ETL tool (SSIS/ODI).
- Developed Spark application for Batch processing.
- Implemented Partitioning, Dynamic partitioning and Bucketing in Hive using internal and external table for more efficient data.
- Using Open Source packages, designed POC to demonstrate Integration of Kafka/Flume with Spark Streaming for real-time data Ingestion and processing.
- Designed Sqoop application for migration of sensitive PHI data residing on Netezza (RDBMS) to HDFS / Hive tables.
- Designed Sqoop Application for CDC (Change Data Capture) process and integrated it with Spark and Oozie.
- Performed data validation using Sqoop on the exported data.
- Performed transformations / analysis by writing complex HQL queries in Hive and exported result to HDFS in discrete file format (JSON, AVRO, Parquet, ORC).
- Implemented data ingestion and transformation using automated workflows using Oozie.
- Utilizing Cloudera Navigator created Audit reports, which notifies security threat and will track all the user / tools activity which uses various Hadoop components.
- Developed strategy for various Hadoop components used to track Data Lineage and Meta-Data extracted from Pipeline using Cloudera Navigator.
- Designed various plots showing HDFS Analytics and Other operations performed on the environment.
- Worked with Infra team for testing the environment after patches, upgrades and migration take place.
- Developed multiple python scripts for delivering End-To-End support and common routines, while maintaining product integrity.
- Performed analytical queries using Impala (Cloudera) for swifter responses and exported result to Tableau for data visualization.
- Documented all the applications worked on and presented it to higher level.
Technologies Used - Cloudera, LINUX, Hadoop, HDFS, HBase, Hive, Spark, MapReduce, Sqoop, Flume, Kafka, Python, Netezza, Oozie, PyCharm, Cloudera Impala, Tableau 10.0, GitHub.
- Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
- Analyzed large data sets by running Hive queries and exported the results as views/ Flat-files, etc.
- Worked with the Data Science team to gather requirements for various data mining projects and conducted POC.
- Involved in creating Hive managed / external tables while maintaining raw files integrity and analyzed data using hive queries.
- Worked on Spark core Libraries like RDD, Spark SQL, Spark Streaming modules of Spark extensively by handling structured and unstructured data.
- Designed Spark applications performing ETL transformations using Python API.
- Developed Simple to complex MapReduce Jobs using Hive.
- Wrote Hive Queries to have a consolidated view of the mortgage and retail data.
- Orchestrated hundreds of sqoop scripts, pig scripts, hive queries using oozie workflows and sub-workflows.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting
- Developed Python utility to validate ingested Hive tables using Sqoop with source RDBMS tables
- Involved in running Hadoop jobs for processing millions of records of text data.
- Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
- Responsible for managing data from multiple sources using Flume.
- Load and Transform large data sets consisted of structured, semi structured and unstructured data.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries.
- Using Python libraries like Pandas, NumPy, and SciPy performed statistical analysis on dataset. Discovered patterns by evaluating parameters, which recommends top parameters to focus while designing.
- Integrated Tableau with Impala as a source to create interactive BI dashboard.
Technologies Used - Cloudera, LINUX, Hadoop, HDFS, Hive, Spark, MapReduce, Sqoop, Flume, Teradata, Python, MySQL, Oozie, Impala, Tableau 10.0.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in java for data cleaning and preprocessing.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Experience in defining job flows using Oozie and shell scripts.
- Experienced in implementing various customizations in MapReduce at various levels by implementing custom input formats, custom record readers, partitioners and data types in java.
- Experience in ingesting data using flume from web server logs and telnet sources.
- Installed and configured Cloudera Manager, Hive, Pig, Sqoop, and Oozie on CDH5 cluster.
- Experienced in managing disaster recovery cluster and responsible for data migration and backup.
- Performed an upgrade in development environment from CDH 4.x to CDH 5.x.
- Implemented encryption and masking on customer sensitive data in flume by building a custom interceptor and masking and encrypting the data as per the requirement by considering the rules in MySQL.
- Experience in managing and reviewing Hadoop log files.
- Extracted files from RDBMS through Sqoop and placed in HDFS and processed.
- Experience in running Hadoop streaming jobs to process terabytes of xml format data.
- Supported MapReduce Programs those are running on the cluster.
- Involved in loading data from UNIX file system to HDFS.
- Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
- Experiences in implementing Hive-HBase integration by creating hive external tables and using HBase storage handler.
- Executed queries using Hive and developed MapReduce jobs to analyze data.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed Hive queries for the analysts.
- Involved in loading data from LINUX and UNIX filesystem to HDFS.
- Designed and implemented MapReduce based large scale parallel relation learning system.
- Developed Master tables in HIVE using a "jsondeserializer" or "get json object" or "json tuple" functions of HIVE.
- Designed the entire flow in HDFS in such a way that is needed to be achieved using Oozie workflows.
Technologies Used - Cloudera, Eclipse, Hadoop, Hive, HBase, MapReduce, Flume, HDFS, PIG, Sqoop, Oozie, Cassandra, Java (JDK 1.6), My SQL, UNIX Shell Scripting.
Jr. Hadoop Developer
- Involved in creating Hive tables, loading with data and writing hive queries to process the data.
- Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive.
- Implemented Partitioning, Bucketing in Hive for better organization of the data.
- Involved with the team of fetching live stream data from DB2 to HBase table using Spark Streaming and Apache Kafka.
- Developing Spark Streaming program on Scala for importing data from the Kafka topics into the HBase tables.
- Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats (Text file, Avro data files, Sequence files, XML and JSON files, ORC and Parquet).
- Involved in the Design Phase for getting live event data from the database to the front-end application using Spark Ecosystem.
- Importing data from hive table and run SQL queries over imported data and existing RDD’s Using Spark SQL.
- Responsible for loading and transforming large sets of structured, semi structured and unstructured data.
- Collected the log data from web servers and integrated into HDFS using Flume.
- Responsible to manage data coming from different sources.
- Extracted files from Couch DB and placed into HDFS using Sqoop and pre-process the data for analysis.
- Developed the sub queries in Hive.
- Partitioning and bucketing the imported data using HiveQL.
- Partitioning dynamically using dynamic-partition insert feature.
- Moving this partitioned data onto the different tables as per as business requirements.
Technologies Used - Eclipse, Hadoop, HDFS, Map Reduce, Pig, Hive, Spark, Kafka, Flume, HBase, Couch DB, Apache-Maven.