- Close to 6 years of IT industry experience encompassing wide range of skill set in Big Data technologies and Java/J2EE technologies
- 3+ years of experience in working with Big Data Technologies on system this comprises of massive amount of data running in highly distributive mode in Cloudera, Hortonworks Hadoop distributions.
- Strong knowledge on Hadoop eco systems including HDFS, Hive, Oozie, HBase, Pig, Sqoop, Zookeeper, Flume, Kafka, MR2, Yarn, Spark etc.
- Excellent knowledge on Hadoop architecture 1.0 and 2.0 as in HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager, Node Manager and MapReduce programming paradigm.
- Good understanding of Data Replication, HDFS Federation, High Availability, Rack Awareness Concepts.
- Hands on experience on creating ETL Pipeline with Flume, Kafka, Spark Streaming and Hive, Spark - SQL.
- Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient Distributed Datasets) and used pyspark and spark-shell accordingly.
- Experience on developing JAVA MapReduce jobs for data cleaning and data manipulation as required for the business.
- Good understanding of different file formats like JSON, Parquet, Avro. Sequence, XML etc.
- Executed complex HiveQL queries for required data extraction from Hive tables and written Hive UDF’s as required.
- A strong ability to prepare and present data in a visually appealing and easy to understand manner using Tableau, Excel etc.
- Good understanding of Supervised and Unsupervised machine learning techniques like K-nn, Random forest, Naïve Bayes, Support Vector Machines(SVM), Hidden Morkov model (HMM) etc
- Extensive experience on Object Oriented Analysis and Design, JAVA/J2EE technologies like Hibernate, Spring MVC
- Expertise in Core Java, data structures, algorithms, Object Oriented Design (OOD) and Java concepts such as OOP Concepts, Collections Framework, Exception Handling, I/O System and Multi-Threading.
- Extensive experience on working with Soap and RestFul webservices.
- Extensive experience in working with Oracle, MS SQL Server, DB2, MySQL RDBMS databases.
- Experienced in working in SDLC, Agile and also Waterfall Methodologies.
- Experience in working with Health Care, Banking and ecommerce industries.
- Ability to meet deadlines without compromising in delivering right output.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player.
- Ability to quickly adapt new environment and technologies.
Big Data Technologies\ Software &Tools\: Hadoop2.x & 1.x, Hive0.14.0, Pig0.14.0, \ Eclipse, Putty, Cygwin, Hue, JIRA, IntelliJ \ Oozie4.1.0, Zookeeper3.4.6, Impala2.1.0, \ IDEA, NetBeans, Jenkins, confluence\ Sqoop1.4.6, MapReduce2.x, Tez0.6.0, \ Spark1.4.0, Flume1.5.2, HBase0.98.0, \ Solr4.0.0, Kafka0.8.0, YARN, Avro, Parquet\
Distributions: \ Monitoring Tools\ Cloudera, Hortonworks\ Cloudera Manager, Ambari\
Java Technologies\ Testing Methodologies\: Core JAVA, JSP, Servlets, spring, Hibernate, \ JUnit, MRUnit\ Ant, Maven\
Programming Languages\ Databases\: JAVA, SQL, PigLatin, HiveQL, Shell \ NoSQL (HBase), Oracle 12c/11g, MySQL, \ Scripting, Python, Scala\ DB2, MS SQL Server\
Operating Systems\ ETL Tools\: Windows, Linux (RHEL, CentOS, Ubuntu)\ Tableau, Pentaho, Talend\
Confidential, Hoboken, New Jersey
Big Data Engineer
- Configured Kafka/Flume ingestion pipeline to transmit the logs from webserver to the Hadoop.
- Used interceptors with RegEx as part of flume configuration to eliminate the chunk from logs and dump the rest into HDFS.
- Used the Avro SerDe's for serialization and de-serialization of log files at different flume agents.
- Created PigLatin scripts for deduplication of the log files if any due to flume agent crash, and extract the required features from the corresponding raw data.
- Implemented the processing algorithms like Sessionization in Pyspark by grouping the user browsing patterns over a period of time.
- Processed the data in batch using Spark and then stored in Parquet file format and used compression techniques like Snappy for high performance querying
- Involved in partitioning the raw data, processed data each by day using three level partitioning schemes.
- Created the external tables in Hive based on the processed data obtained from Spark.
- Ingested the secondary data from systems like CRM, CPS, ODS using Sqoop and correlated this data with log files providing the platform for data analysis
- Performed basic aggregations like count, average, sum, distinct, max, min on the existing hive tables using impala to determine Average Hit rates, Miss rates, Bounce rates etc.
- Persisted the processed data in columnar databases like HBASE and provided the platform for analytics using BI tools, analytical tools like R, machine learning such as Mahout and SparkMLib
- Involved in running and orchestrating the entire flow daily using Oozie jobs.
- Involved in terminating all the sessions per job and reassigning them for the next job
- Able to tackle the problems and accomplished the tasks which should be done during the sprint.
Environment: Flume 1.5.2, Sqoop1.4.6, HDFS2.6.0, Hadoop2.6.0, Hive0.14.0, Hbase0.98.0, Impala2.1.0, Pig 0.14.0, Spark1.4.0, Oozie 4.1.0
Confidential, Syracuse, NY
- Used Sqoop to import the tables from OLTP, OLAP, CRM databases directly into the HDFS to offload the enterprise data warehouse(EDW)
- Transformed the imported tables from highly normalized to dimensional tables that are based on star schema, thus transforming the data to more query able form
- Imported the least updated smaller tables on each run, overriding the corresponding existing tables in HDFS.
- Incrementally imported the most updated tables in OLTP into the unique tables called history tables in HDFS than overriding the corresponding existing tables.
- Merged the contents of newly updated/added history table, with its corresponding table in HDFS and populated them into a location on which a new external Hive table is created.
- Performed Aggregations like average, count, sum and populated them to the newer data set in Hadoop that are exported to EDW, providing the low latency and frequent querying capability with BI tools.
- Closely monitored the tradeoff between overhead of the de-normalization and the performance improvement in reducing the joins and chose the best fit for that data.
- Used the Avro format for the incremental imported dataset (History table) and Parquet file for the wider fact tables.
- Stored the large data sets that are rapidly changing in HBASE, optimizing the updates.
- Partitioned the incremental imported datasets so that we can only query the recently updated/added contents during the last run and populate to their corresponding merged table.
- Effectively implemented the partitioning and bucketing techniques to confine the I/O operations to a subset of data which is required
- Used Oozie to orchestrate the above ETL process every day.
Environment: Sqoop1.4.6, flume1.5.2, CRM, ODS, HBASE 0.98.0, Hive 0.14.0, Hadoop 2.6.0 (Hortonwork distribution), Avro, Parquet, Oozie 4.1.0
- Worked on 10 node Hadoop cluster with 0.5TB/day data.
- Installed and configured Flume, Zookeeper and Kerberos on the Hadoop cluster.
- Created the flume ETL pipeline to move the weblogs from firewall server to HDFS.
- Developed Map Reduce Programs in JAVA for data analysis and data cleaning.
- Developed PIG Latin scripts to extract the features like location, IP, event status code etc
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Connected Excel to the Hive using ODBC and loaded the contents of table.
- Generated the reports frequently on error logs obtained and provided the platform for thre analysis for distributed denial of service (DDOS) if any.
- Created, scheduled and frequently ran the jobs in Oozie to automate the workflow.
Environment: Hadoop1.x, MapReduce2.0, Hive0.12.0, HDFS, Pig0.12.0, CDH4.x, Oozie4.0.0, Cloudera Manager, Excel 2010
- Developed multithreaded programs using Core Java to measure system performance.
- Implemented Spring MVC in the application. Involved in XML configuration for obtaining bean references in spring framework using Dependency Injection (DI) or Inversion of Control (IOC).
- Used Object/Relational mapping Hibernate framework as the persistence layer for interacting with Oracle.
- Implemented RESTfu l Web Services for non-sensitive information consume.
- Created Secure Web services using SOAP Security Extensions and Certificates for payment info consume.
- Wrote stored procedures in Oracle 10g using PL/SQL for data entry and retrieval in Reports module.
- Used GIT as version control to commit the changes in local and remote repository.
- Used Maven to package the java application and deployed it on Weblogic Application Server.
- Used Jenkins as continuous integration tool to pull the code from version control, package the code and deploy it on the Application Server automating the build and deploy cycle.
Environment: Java 1.6, JSP, Spring3.0, Hibernate3.0, MyEclipse, Java Script, JSTL, Unix, Shell script, AJAX, XML, SQL, PL SQL Oracle10g, Weblogic 10.3.2, Webservices (SOAP, RESTFUL), GIT 1.7 Maven3.0.2 Jenkins1.455
- Developed user interfaces templates using SPRING MVC, JSP.
- Involved in development of form validations using simple form controller.
- Responsible for implementation of controllers like simple form controller
- Implementing design patters DAO, Singleton, Business delegate, strategy design pattern.
- Used Spring 3.0 frame work to implement SPRING MVC Design pattern.
- Used JMS queue for cross communication among different components in the application.
- Designed, developed and deployed the J2EE components on Tomcat.
- Used tools like Hibernate for OR-Mapping on Oracle database.
- Involved in Transaction management and AOP using Spring.
- Pulled the source code from Subversion repository and packaged the java application with the help of Ant scripts.
- Deployed the enterprise Java applications on Apache Tomcat Server.
Environment: JAVA/J2EE, JSP, Spring 3.0 framework, Oracle9i, Hibernate3.0, SVN1.6, ANT 1.8.2, Apache Tomcat7.0