- Having over 5+ years of total IT experience with over 3 years of experience in Big Data Hadoop, 2 years of experience in Java and ETL Projects with extensive knowledge in pharma and finance domain
- Experience in HDP (Hortonworks Data Platform) distributed model.
- A former Java programmer with newly acquired skills, an insatiable intellectual curiosity, and the ability to mine hidden gems located within large sets of structured, semi - structured and unstructured data.
- Worked on recommendation platform based on content and collaboration models.
- Worked with the Data Science team to gather requirements for various data mining projects.
- Developed multiple Map Reduce jobs in JAVA and PIG for data cleaning and preprocessing.
- Hands on experience in Hadoop ecosystem components like Map Reduce, HDFS, Sqoop, Pig, Hive and Oozie.
- Working experience in ingesting data on to the clusters using Sqoop (incremental)
- Expert in working with Hive data warehouse tool-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HiveQL queries.
- Hands on experience in setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
- Good knowledge of Scala APIs
- Working knowledge of R and Python
- Experience in using Hcatalog for Hive, Pig and HBase and their integration
- Load and transform large sets of structured, semi structured and unstructured data
- Experience in Administration of Hadoop Eco systems.
- Used the Microsoft Query feature to access Hive data with Hive ODBC driver and also use the Excel Power View feature to analyze the data and also uses Tableau
- Experience in developing stored procedures, triggers using SQL, PL/SQL in relational databases such as MS SQL Server 2005/2008.
- Proficiency in SDLC methodologies and development processes such as requirement gathering, analysis and definition, proof of concept, designing and implementation.
- A proactive planner with a flair for adopting emerging trends and addressing industry requirements to achieve organizational objectives.
- An effective communicator with exceptional analytical, technical, negotiation and management skills with the ability to relate to people at any level of business and management
Object Oriented Languages: Java, Python
Statistical languages: R, Python, Matlab, SAS
Query languages: Sql, plsql
Hadoop ecosystem: MapReduce, Hive, Pig, HDFS, Sqoop, Flume, Oozie, Hbase
Technologies: Distributed systems, Machine learning, Data mining
Distributions: Hortonworks (HDP 1.2), Cloudera distribution
Markup languages: HTML, XML, JSON
Servers: Websphere, Weblogic, Tomcat
Databases: Oracle 12c/11g/10g, MySQL, HBase, NoSQL
Revision controlling systems: CVS, Github, SVC
Data modeling tools: RStudio, SPSS
ETL tools: Informatica, Datastage
File Systems: HDFS, Linux, Windows
Java and J2EE technologies: Servlets, JSP, JDBC
Query Performance Tuning: Oracle hints, query execution analysis, indexes, partitions
Data visualization tool: Tableau
Statistical analysis: A/B testing, Hypothesis testing, ANOVA, T- tests, F-tests, Central limit theorem
Distribution analysis: Histograms, Scatter plots, Scatter matrices, Heat maps
Scripting languages: Shell scripting, perl
IDEs: Eclipse, Netbeans, Wing, Spyder
Agile platform: Rally
MS Office tools: Excel, Powerpoint, Word
Confidential, Lawrenceville, NJ
- Imported data from RDBMS systems to HDFS cluster using Sqoop
- Created HIVE staging tables to store imported data
- Developed HQL scripts to preprocess staging data
- Developed custom UDF ’s in Java and used them in Hive queries
- Developed Informatica mappings, workflows, applications to transform these data sets
- Developed Pig scripts to process some clinical studies
- Developed shell scripts to call these HQL scripts and Informatica workflows
- Optimized Hive queries using - hints, map side joins, predicate pushdown, orc tables, cost based optimizations
- Performed data quality checks using QuerySurge
- Created data models using Erwin data modeler
- Developed java utilities to parse, transform, generate name-value pairs, combine results from spreadsheets using Apache dependencies
- Used Informatica analyst to check the health of data
Environment : HDFS, Map Reduce, YARN, Hive, Sqoop, Pig, Java, shell scripts, Informatica BDE, Erwin, QuerySurge, Spotfire, Ambari, Toad, Oracle 12c
Confidential, Boston, MA
- Loaded data from RDBMS server to HDFS cluster
- Developed ETL scripts to load data into warehouse
- Created HIVE tables to store processed results in tabular format
- Developed Sqoop scripts to make interaction between Hive and Oracle database
- Developed Counters to debug complex mapreduce programs
- Developed complex Reduce side joins
- Worked on optimization of Map reduce jobs using combiners
- Used ORC Format to improve the performance of HIVE queries
- Created managed tables and external tables in HIVE
- Performed complex HQL queries on HIVE tables
- Optimized Hive tables using optimization techniques like partitioning and bucketing to provide better performance with HQL queries
- Implemented dynamic partitions
- Created custom user defined functions in Hive
- Scheduled jobs in production environment using Oozie scheduler
- Debugged jobs using counters and Hadoop logs
- Part of team that developed PIG scripts
Environment : Hadoop, Hive, Sqoop, Pig, Java, shell scripts, sql developer plus, Sql server
Data Science Intern
- Plotted histograms to look the distributions of variables
- Used scatter plots, heat maps and correlation coefficients to get rid of correlated features
- Identified correlations and distributions using Tableau
- Used principal component analysis to factor only top few Eigen vectors into the model
- Performed scaling and transformation of variables to improve the performance of gradient descent approaches
- Implemented logistic regression , decision trees and Navies Bayesian models to predict whether loan is default
- Compared the performance of the models using ROC curves
- Finally built random forests to further improve the performance of the model, avoiding overfitting problems
Environment: R, RStudio, Python
Software Engineer Intern
- An interactive android application which analysis user’s aptitude by introducing levels based on the complexity and percentage of correct answers for a given set of questions
- Client back-end is implemented using Java
- Server back - end is implemented using Php
- Front-end is designed using Xml
- MySql is used for creating and maintaining database