- Above 8 years of professional experience which includes Analysis, Design, Development, Integration, Deployment and Maintenance of quality software applications using Java/J2EE Technologies and Big data Hadoop technologies.
- Above 4 Years of working experience in data analysis and data mining using Big Data Stack.
- Proficiency in Java, Hadoop Map Reduce, Pig, Hive, Oozie, Sqoop, Flume, HBase, Scala, Spark, Kafka, Storm, Impala and NoSQL Databases.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and the Hadoop Infrastructure.
- Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming paradigm.
- Good exposure on usage of NoSQL databases column-oriented HBase and Cassandra.
- Extensive experienced in working with semi/unstructured data by implementing complex map reduce programs using design patterns.
- Extensive experience writing custom Map Reduce programs for data processing and UDFs for both Hive and Pig in Java.
- Strong experience in analyzing large amounts of data sets writing Pig scripts and Hive queries.
- Extensive experienced in working with structured data using Hive QL, join operations, writing custom UDF’s and experienced in optimizing Hive Queries.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database.
- Experienced in job workflow scheduling and monitoring tools like Oozie.
- Experience in Apache Flume for collecting, aggregating and moving huge chunks of data from various sources such as webserver, telnet sources etc.
- Hands on experience in major Big Data components Apache Kafka, Apache spark, Zookeeper, Avro.
- Experienced in implementing unified data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies.
- Experienced in migrating map reduce programs into Spark RDD transformations, actions to improve performance.
- Experience with using Big Data with ETL (Talend).
- Experience with ETL - Extract Transform and Load - Talend Open Studio, Informatica.
- Strong experience in architecting real time streaming applications and batch style large scale distributed computing applications using tools like Spark Streaming, Spark SQL, Kafka, Flume, Map reduce, Hive etc.
- Experience using various Hadoop Distributions (Cloudera, Hortonworks, MapR etc.) to fully implement and leverage new Hadoop features
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.,
- Good Knowledge in Amazon AWS concepts like EMR and EC2webservices which provides fast and efficient processing of Big Data.
- Experienced in working with different scripting technologies like Python, Unix shell scripts.
- Experience on Source control repositories like SVN, CVS and GIT.
- Strong experienced in working with UNIX/LINUX environments, writing shell scripts.
- Skilled at build/deploy multi module applications using Maven, Ant and servers like Jenkins.
- Adequate knowledge and working experience in Agile & Waterfall methodologies.
- Excellent problem solving, and analytical skills.
Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Apache Nifi, Zookeeper, Cloudera Manager, Ambari.
NoSQL Database: MongoDB, Cassandra
Real Time/Stream processing: Apache Storm, Apache Spark
Distributed message broker: Apache Kafka
Monitoring and Reporting: Tableau, Custom shell scripts.
Hadoop Distribution: Horton Works, Cloudera, MapR.
Build Tools: Maven, SQL Developer
Programming & Scripting: JAVA, C, SQL, Shell Scripting, Python
Databases: Oracle, MY SQL, MS SQL server
Tools: Eclipse, MQ explorer, RFH util, SSRS, Aqua Data Studio, XML Spy, ETL(talend)
Operating Systems: Linux, Unix, Mac OS-X, Windows 10, Windows 8, Windows 7, Windows Server 2008/2003
- Worked on installing cluster, commissioning and decommissioning of Data Nodes, Name Node recover, capacity planning in the cloud environment (Microsoft Azure).
- Managed Hadoop cluster with 29 nodes having HDP(Hortonworks) distribution using Ambari and HDP 2.6 leveraging the cloud environment from Microsoft Azure.
- Used a tool called Cloudbreak for provisioning and managing Apache Hadoop clusters in the cloud (Microsoft Azure). Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure.
- Monitored cluster for performance, networking and data integrity issues. Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Formulated procedures for installation of Hadoop patches, updates and version upgrades .
- Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contains the click stream data from Mixpanel and Google Analytics .
- Used Apache Nifi for ingestion of data from Mixpanel API on to HDFS in raw JSON format.
- Developed optimal strategies for distributing the click stream data over the cluster by importing the data into HDFS through connecting to the Mixpanel, Google Analytics API.
- Developed custom shell scripts to connect to the Mixpanel and Google Analytics API and used Crontab for scheduling purposes.
- Designed and implemented Hive queries and functions for evaluation, filtering, loading and storing of data.
- Developed hive tables on top of the consumed JSON data from Mixpanel API and stored them in ORC format for optimized querying in tableau.
- Used custom shell scripts to convert the Google Analytics data format(dic) to JSON and then dumped it on HDFS for further analytics.
- Worked Hive database to provide both Historical and live clickstream data from Mixpanel and Google Analytics to tableau for historical and live reporting.
Environment: Hortonworks Data Platform (HDP), Hortonworks Data Flow(HDF), Hadoop, HDFS, Spark, Hive, MapReduce, Apache Nifi, Tableau Desktop, Linux, Microsoft Azure, Cloudbreak.Confidential, Jacksonville, FL
Big Data Developer
- Collected and aggregated large amounts of data from different sources such as COSMA (CSX Onboard System Management Agent), BOMR (Back Office Message Router), ITCM (Interoperable train control messaging), Onboard mobile and network devices from the PTC (Positive Train Control) network using Apache Nifi and stored the data into HDFS for analysis.
- Used Apache Nifi for ingestion of data from the IBM MQ’s (Messages Queue).
- Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Developed Java Map Reduce programs on ITCM log data to transform into structured way.
- Developed optimal strategies for distributing the ITCM log data over the cluster; importing and exporting the stored log data into HDFS and Hive using Apache Nifi.
- Developed custom code to read the messages of the IBM MQ and to dump them onto the Nifi Queues.
- Worked with the Apache Nifi flow to perform the conversion of Raw XML data into JSON, AVRO.
- Implemented Hive Generic UDF’s to in corporate business logic into Hive Queries.
- Configuring Spark Streaming to receive real time data from IBM MQ and store the stream data to HDFS.
- Analyzed the Bandwidth data from the locomotive using the HiveQL to extract the Bandwidth consumed by each locomotive in a day using different carriers AT&T, Verizon or Wi-Fi.
- Designed and implemented Hive queries and functions for evaluation, filtering, loading and storing of data.
- Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contains the Bandwidth data form the locomotive through the Hortonworks ODBC connector for further analytics of the data.
- Collected and provided locomotive communication usage data by locomotive, channel, protocol and by application.
- Analyzed the Locomotive Communication Usage from COSMA to monitor in/out-bound traffic bandwidth by communication channel.
- Worked on back-end Hive database to provide both Historical and live Bandwidth data from the locomotives to tableau for historical and live reporting.
Environment: Hortonworks Data Platform (HDP), Hortonworks Data Flow(HDF), Hadoop, HDFS, Spark, Hive, MapReduce, Apache Nifi, Tableau Desktop, Linux.Confidential, Houston, TX
Big Data Systems Engineer
- Installed and configured a three-node cluster with Hortonworks Data Platform (HDP 2.3) on the HP infrastructure and Management.
- Worked with HP Intelligent provisioning and the smart storage array for setting up the disks for the installation.
- Used a Big Data Benchmark tool called BigBench to benchmark the three-node cluster.
- Configured the tool BigBench and had it running on one of the nodes in the cluster.
- Ran the Benchmark for different Datasets of 5GB, 10GB, 50 GB, 100 GB and 1 TB.
- Worked with structured, semi-structured and unstructured data which is automated in the tool BigBench having to run with the workloads using Spark ’s machine learning libraries.
- Configured a PAT (Performance Analysis Tool) for having the benchmark results dumped into the automated charts using MS-Excel .
- Used Ambari Server for monitoring the cluster while the benchmark is running.
- Worked with different teams to install operating system, Hadoop updates, patches, version upgrades of Hortonworks as required.
- Collected the performance metrics from Hadoop nodes, to analyze the resource utilization and draw automated charts using MS-Excel, a Performance Analysis Tool (PAT) was used .
- Worked with various performance monitoring tools like top, dstat, atop and also Ambari metrics.
- Collected the results from the different Datasets (5GB, 10GB, 50GB, 100GB and 1TB) tests on the Server and was able to dump them on to the PAT (Performance Analysis Tool) for further analyzing the resource utilization .
- Had a chance to work with HPE insight CMU (Cluster Management Utility) for managing the cluster and also HPE Vertica for SQL on Hadoop.
- Worked on configuring the performance tuning parameters used during the benchmark.
- Used Tableau Desktop for creating visual Dashboards of CPU utilization, Disk IO, Memory, Network IO and Query Times obtained from the PAT (Performance Analysis tool) automated charts using MS-Excel.
- Had the results obtained from the benchmark output in terms of automated charts being dumped into Tableau Desktop for further data analytics.
- Installed and configured Tableau Desktop on one of the three nodes to connect to the Hortonworks Hive Framework (Database) through the Hortonworks ODBC connector for further analytics of the cluster.
Environment: Hortonworks Data Platform (HDP), Hadoop, HDFS, Spark, Hive, MapReduce, BigBench, Tableau Desktop, Linux.Confidential, Monroeville, PA
Sr. Big Data/Hadoop Developer
- Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Kafka and stored the data into HDFS for analysis.
- Implemented Storm builder topologies to perform cleansing operations before moving data into Cassandra.
- Developed Java Map Reduce programs on log data to transform into structured way.
- Developed optimal strategies for distributing the web log data over the cluster; importing and exporting the stored web log data into HDFS and Hive using Sqoop.
- Implemented Hive Generic UDF’s to in corporate business logic into Hive Queries.
- Configuring Spark Streaming to receive real time data from the Kafka and Store the stream data to HDFS.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Converting Hive Queries to SparkSQL and using parquet file as the storage format.
- Implemented Spark RDD transformations, actions to migrate Map reduce algorithms.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Familiar with ETL (talend) and data integration designed for IT and BI analysts to schedule.
- Creating Hive tables and working on them using Hive QL.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
- Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.
- Involved in complete SDLC of project including requirements gathering, design documents, development, testing and production environments.
- Involved in Agile methodologies, daily scrum meetings, sprint planning.
Environment: Hadoop, HDFS, Map Reduce, Hive, Sqoop, Spark, Scala, Kafka, Oozie, Storm, Cassandra, Maven, Shell Scripting, CDH.Confidential, Springfield, IL
Big Data/Hadoop Developer
- Involved in creating Hive tables, and loading and analyzing data using hive queries.
- Developed Simple to complex Map Reduce Jobs using Hive and Pig.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Migrated complex map reduce programs into in memory Spark processing using Transformations and actions.
- Mentored analyst and test team for writing Hive Queries.
- Developed multiple Map Reduce jobs in java for data cleaning and preprocessing
- Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
- Used ETL(talend) for Extraction, Transformation and Loading of data from multiple sources.
- Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real time analysis.
- Used Cassandra Query language (CQL) to implement CRUD operations on Cassandra file system.
- Develop and maintains complex outbound notification applications that run on custom architectures, using diverse technologies including Core Java, J2EE, SOAP, XML, JMS, JBoss and Web Services.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Load and transform large sets of structured, semi-structured and unstructured data.
- Generated the datasets and loaded to HADOOP Ecosystem.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries, Pig Scripts, Sqoop jobs.
Environment: Horton works, Hadoop, HDFS, Spark, Oozie, Pig, Hive, MapReduce, Sqoop, Cassandra, Linux.Confidential, Des Moines, IA
- Developing parser and loader map reduce application to retrieve data from HDFS and store to HBase and Hive.
- Importing the data from the MySQL into the HDFS using Sqoop.
- Importing the unstructured data into the HDFS using Flume.
- Used Oozie to orchestrate the map reduce jobs that extract the data on a timely manner.
- Written Map Reduce java programs to analyze the log data for large-scale data sets.
- Involved in using HBase Java API on Java application.
- Automated all the jobs for extracting the data from different Data Sources like MySQL to pushing the result set data to Hadoop Distributed File System.
- Customize parser loader application of Data migration to HBase.
- Developed Pig Latin scripts to extract the data from the output files to load into HDFS.
- Developed custom UDFS and implemented Pig scripts.
- Implemented MapReduce jobs using Java API and PIG Latin as well HIVEQL
- Participated in the setup and deployment of Hadoop cluster
- Hands on design and development of an application using Hive (UDF).
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Provide support data analysts in running Pig and Hive queries.
- Involved in HiveQL.
- Involved in Pig Latin.
- Importing and exporting Data from MySQL/Oracle to HiveQL Using SQOOP.
- Importing and exporting Data from MySQL/Oracle to HDFS.
- Configured HA cluster for both Manual failover and Automatic failover.
- Designed and built many applications to deal with vast amounts of data flowing through multiple Hadoop clusters, using Pig Latin and Java-based map-reduce.
- Specifying the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
- Responsible for defining the data flow within Hadoop eco system and direct the team in implement them.
- Exported the result set from Hive to MySQL using Shell scripts.
- Developed HIVE queries for the analysts.
Environment: Apache Hadoop, Hive, Hue Tool, Zookeeper, Map Reduce, Sqoop, Pig, HCatalog, Unix, Java, JSP, Eclipse, Maven, SQL, HTML, XML, Oracle, SQL Server, MYSQLConfidential
Sr. Java Developer
- Involved in development of business domain concepts into Use Cases, Sequence Diagrams, Class Diagrams, Component Diagrams and Implementation Diagrams.
- Implemented various J2EE Design Patterns such as Model-View-Controller, Data Access Object, Business Delegate and Transfer Object.
- Responsible for analysis and design of the application based on MVC Architecture, using open source Struts Framework.
- Involved in configuring Struts, Tiles and developing the configuration files.
- Developed Struts Action classes and Validation classes using Struts controller component and Struts validation framework.
- Used Spring Framework and integrated it with Struts.
- Involved in Configuring web.xml and struts-config.xml according to the struts framework.
- Designed a lightweight model for the product using Inversion of Control principle and implemented it successfully using Spring IOC Container.
- Used transaction interceptor provided by Spring for declarative Transaction Management.
- The dependencies between the classes were managed by Spring using the Dependency Injection to promote loose coupling between them.
- Provided connections using JDBC to the database and developed SQL queries to manipulate the data.
- Developed ANT script for auto generation and deployment of the web service.
- Wrote stored procedure and used JAVA APIs to call these procedures.
- Developed various test cases such as unit tests, mock tests, and integration tests using the JUNIT.
- Experience writing Stored Procedures, Functions and Packages
- Used log4j to perform logging in the applications.
- Involved in Full Life Cycle Development in Distributed Environment Using Java and J2EE framework.
- Responsible for developing and modifying the existing service layer based on the business requirements.
- Involved in designing & developing web-services using SOAP and WSDL.
- Involved in database design.
- Created tables, stored procedures in SQL for data manipulation and retrieval, Database Modification using SQL, PL/SQL, Stored procedures, triggers, Views in Oracle 9i.
- Created User Interface using JSF.
- Involved in integration testing the Business Logic layer and Data Access layer.
- Integrated JSF with JSP and used JSF Custom Tag Libraries to display the value of variables defined in configuration files.
- Involved in JUnit testing of the application using JUnit framework.
- Written Stored Procedures functions and views to retrieve the data.
- Used Maven builds to wrap around Ant build scripts.
- CVS tool is used for version control of code and project documents.
- Responsible to mentor/work with team members to make sure the standards and guidelines are followed and delivery of tasks in time.