We provide IT Staff Augmentation Services!

Hadoop Developer Resume

Lawrenceville, NJ


  • Having over 5+ years of total IT experience with over 3 years of experience in Big Data Hadoop, 2 years of experience in Java and ETL Projects with extensive knowledge in pharma and finance domain
  • Experience in HDP (Hortonworks Data Platform) distributed model.
  • A former Java programmer with newly acquired skills, an insatiable intellectual curiosity, and the ability to mine hidden gems located within large sets of structured, semi - structured and unstructured data.
  • Worked on recommendation platform based on content and collaboration models.
  • Worked with the Data Science team to gather requirements for various data mining projects.
  • Developed multiple Map Reduce jobs in JAVA and PIG for data cleaning and preprocessing.
  • Hands on experience in Hadoop ecosystem components like Map Reduce, HDFS, Sqoop, Pig, Hive and Oozie.
  • Working experience in ingesting data on to the clusters using Sqoop (incremental)
  • Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
  • Hands on experience in setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
  • Good knowledge of Scala APIs
  • Working knowledge of R and Python
  • Experience in using Hcatalog for Hive, Pig and HBase and their integration
  • Load and transform large sets of structured, semi structured and unstructured data
  • Experience in Administration of Hadoop Eco systems.
  • Used the Microsoft Query feature to access Hive data with Hive ODBC driver and also use the Excel Power View feature to analyze the data and also uses Tableau
  • Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
  • Proficiency in SDLC methodologies and development processes such as requirement gathering, analysis and definition, proof of concept, designing and implementation.
  • A proactive planner with a flair for adopting emerging trends and addressing industry requirements to achieve organizational objectives.
  • An effective communicator with exceptional analytical, technical, negotiation and management skills with the ability to relate to people at any level of business and management


Object Oriented Languages: Java, Python

Statistical languages: R, Python, Matlab, SAS

Query languages: Sql, plsql

Hadoop ecosystem: MapReduce, Hive, Pig, HDFS, Sqoop, Flume, Oozie, Hbase

Technologies: Distributed systems, Machine learning, Data mining

Distributions: Hortonworks (HDP 1.2), Cloudera distribution

Markup languages: HTML, XML, JSON

Servers: Websphere, Weblogic, Tomcat

Databases: Oracle 12c/11g/10g, MySQL, HBase, NoSQL

Revision controlling systems: CVS, Github, SVC

Data modeling tools: RStudio, SPSS

ETL tools: Informatica, Datastage

File Systems: HDFS, Linux, Windows

Java and J2EE technologies: Servlets, JSP, JDBC

Query Performance Tuning: Oracle hints, query execution analysis, indexes, partitions

Data visualization tool: Tableau

Statistical analysis: A/B testing, Hypothesis testing, ANOVA, T- tests, F-tests, Central limit theorem

Distribution analysis: Histograms, Scatter plots, Scatter matrices, Heat maps

Scripting languages: Shell scripting, perl

IDEs: Eclipse, Netbeans, Wing, Spyder

Agile platform: Rally

MS Office tools: Excel, Powerpoint, Word


Confidential, Lawrenceville, NJ

Hadoop developer


  • Imported data from RDBMS systems to HDFS cluster using Sqoop
  • Created HIVE staging tables to store imported data
  • Developed HQL scripts to preprocess staging data
  • Developed custom UDF ’s in Java and used them in Hive queries
  • Developed Informatica mappings, workflows, applications to transform these data sets
  • Developed Pig scripts to process some clinical studies
  • Developed shell scripts to call these HQL scripts and Informatica workflows
  • Optimized Hive queries using - hints, map side joins, predicate pushdown, orc tables, cost based optimizations
  • Performed data quality checks using QuerySurge
  • Created data models using Erwin data modeler
  • Developed java utilities to parse, transform, generate name-value pairs, combine results from spreadsheets using Apache dependencies
  • Used Informatica analyst to check the health of data

Environment : HDFS, Map Reduce, YARN, Hive, Sqoop, Pig, Java, shell scripts, Informatica BDE, Erwin, QuerySurge, Spotfire, Ambari, Toad, Oracle 12c

Confidential, Boston, MA

Hadoop developer


  • Loaded data from RDBMS server to HDFS cluster
  • Developed ETL scripts to load data into warehouse
  • Created HIVE tables to store processed results in tabular format
  • Developed Sqoop scripts to make interaction between Hive and Oracle database
  • Developed Counters to debug complex mapreduce programs
  • Developed complex Reduce side joins
  • Worked on optimization of Map reduce jobs using combiners
  • Used ORC Format to improve the performance of HIVE queries
  • Created managed tables and external tables in HIVE
  • Performed complex HQL queries on HIVE tables
  • Optimized Hive tables using optimization techniques like partitioning and bucketing to provide better performance with HQL queries
  • Implemented dynamic partitions
  • Created custom user defined functions in Hive
  • Scheduled jobs in production environment using Oozie scheduler
  • Debugged jobs using counters and Hadoop logs
  • Part of team that developed PIG scripts

Environment : Hadoop, Hive, Sqoop, Pig, Java, shell scripts, sql developer plus, Sql server


Data Science Intern


  • Plotted histograms to look the distributions of variables
  • Used scatter plots, heat maps and correlation coefficients to get rid of correlated features
  • Identified correlations and distributions using Tableau
  • Used principal component analysis to factor only top few Eigen vectors into the model
  • Performed scaling and transformation of variables to improve the performance of gradient descent approaches
  • Implemented logistic regression , decision trees and Navies Bayesian models to predict whether loan is default
  • Compared the performance of the models using ROC curves
  • Finally built random forests to further improve the performance of the model, avoiding overfitting problems

Environment: R, RStudio, Python


Software Engineer Intern


  • An interactive android application which analysis user’s aptitude by introducing levels based on the complexity and percentage of correct answers for a given set of questions
  • Client back-end is implemented using Java
  • Server back - end is implemented using Php
  • Front-end is designed using Xml
  • MySql is used for creating and maintaining database

Environment: Java 1.6, PHP, HTML, CSS, Javascript

Hire Now