- 8+ years of professional experience involving project development, implementation, deployment and maintenance using Java/J2EE and Big Data related technologies.
- Hadoop Developer with 5 years of working experience in designing and implementing complete end - to-end Hadoop based data analytical solutions using HDFS, MapReduce, Spark, Yarn, Kafka, PIG, HIVE, Sqoop, Storm, Flume, Oozie, Impala, HBase etc.
- Experience in creating data ingestion pipelines, data transformations, data management, data governance and real time streaming at an enterprise level.
- Experience in importing and exporting different formats of data into HDFS, HBASE from different RDBMS databases and vice versa using Sqoop.
- Exposure to Cloudera development environment and management using Cloudera Manager.
- Experience in analyzing data using HiveQL, Pig Latin, HBASE, Mongo and custom MapReduce programs in Java.
- Experience in extending Hive and Pig core functionality by writing custom UDFs using Java.
- Developed analytical components using Spark and Spark Stream background with traditional databases such as Oracle, SQL Server, MySQL.
- Hands-on experience in storing, processing unstructured data using NOSQL databases like HBase and MongoDB.
- Hands on experience in setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
- Experience in creating complex SQL Queries and SQL tuning, writing PL/SQL blocks like stored procedures, Functions, Cursors, Index, triggers and packages.
- Good knowledge of database connectivity (JDBC) for databases like Oracle, DB2, SQL Server, MySQL, NoSQL, MS Access.
- Profound experience in creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Developed analytical components using Spark and Spark Stream.
- Worked on a prototype Apache Spark Streaming project and converted our existing Java Strom Topology.
- Proficient in visualizing data using Tableau, QlikView, MicroStrategy and MS Excel.
- Experience in developing ETL scripts for data acquisition and transformation using Informatica and Talend.
- Used Maven extensively for building MapReduce jar files and deployed it to Amazon Web Services (AWS) using EC2 virtual Servers in the cloud and Experience in build scripts to do continuous integrations systems like Jenkins.
- Experienced in Java Application Development, Client/Server Applications, Internet/Intranet based applications using Core Java, J2EE patterns, Spring, Hibernate, Struts, JMS, Web Services (SOAP/REST), Oracle, SQL Server and other relational databases.
- Experience writing Shell scripts in Linux OS and integrating them with other solutions.
- Experienced in using agile methodologies including extreme programming, SCRUM and Test Driven Development (TDD).
Big Data Technologies: Hive, Map Reduce, Kafka, spark, Cassandra, NiFi, Pig, Hcatalog, Phoenix, Falcon, Scoop, Flume, Zookeeper, Mahout, Oozie, Avro, HBase, MapReduce, Storm, HDP 2.4, 2.6, CDH 5.x
Devops Tools: Chef, Puppet
Monitoring Tools: Cloudera Manager, Ambari
Languages: Java, Hive, Pig, SQL, Python
Application Servers: Apache Tomcat, WebLogic Server, Web sphere
Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure and Google Cloud
Databases: Oracle 11g, MySQL, MS SQL Server, IBM DB2.
Dataflow tools: NiFi, Airflow.
NoSQL Databases: HBase, Cassandra, MongoDB, NiFi
Operating Systems: Linux, UNIX, Mac OS X 10.9.5, Windows NT / 98 /2000/ XP / Vista, Windows 7, Windows 8.
Networks: HTTP, HTTPS, FTP, UDP, TCP/TP, SNMP, SMTP.
Confidential - Atlanta, Georgia
- Worked on Hadoop cluster scaling from 4 nodes in development environment to 8 nodes in pre-production stage and up to 24 nodes in production
- Involved in complete Implementation lifecycle, specialized in writing custom MapReduce, Pig and Hive.
- Have been using NiFi for transferring data from source to destination and Responsible for handling batch as well as Real-time Spark jobs through NiFi.
- Responsible in Installation and Configuration of Hadoop Eco system components using CDH 5.2 Distribution. Built logger service using Kafka which would log the data real-time from NiFi queues.
- Developed micro-services using Python scripts in Spark DataFrame API’s for the semantic layer.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Built the complete data ingestion pipeline using NiFi which POST’s flow file through invoke HTTP processor to our Microservices hosted inside the Docker containers.
- Processed Multiple Data sources input to same Reducer using Generic Writable and Multi Input format. Performed data profiling and transformation on the raw data using Pig and Python.
- Visualize the HDFS data to customer using BI tool with the help of Hive ODBC Driver.
- Customized BI tool for manager team that perform Query analytics using HiveQL.
- Imported data using Sqoop to load data from MySQL to HDFS on regular basis.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Created Hive Generic UDF's to process business logic that varies based on policy.
- Moved Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables. Monitoring Cluster using Cloudera manager.
- Input XML Data is being transformed to JSON as per the requirement of downstream applications.
- Built data governance processes, procedures, and control for Data Platform using Nifi.
- Creating real time data streaming solutions and batch style large scale distributed computing applications using Apache Spark, Spark Streaming, Kafka and Flume.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL and Big Data technologies. Analyzing the requirements to develop the framework.
- Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
- Developed python Spark streaming scripts to load raw files and corresponding.
- Implemented Pyspark logic to transform and process various formats of data like XLS, XLS, JSON, and TXT. Elaborated Python Scripts to fetch/get S3 files using Boto3 module.
- Built scripts to load Pyspark processed files into Redshift Db and used diverse Pyspark logics.
- Developed scripts to monitor and capture state of each file which is being through.
- Developed Map Reduce programs to cleanse the data in HDFS obtained from heterogeneous data sources. Processed metadata files into AWS S3 and Elasticsearch cluster.
- Involved in scheduling Oozie workflow engine to run multiple Hives and pig jobs and used Oozie Operational Services for batch processing and scheduling workflows dynamically.
- Included migration of existing applications and development of new applications using AWS cloud services. Developed Python Scripts to get the recent S3 keys from Elasticsearch.
- Uploaded click stream data from Kafka to Hdfs, HBase, and Hive by integrating with Storm.
- Extracted data from SQL Server to create automated visualization reports and dashboards on Tableau.
- Involved in collecting metrics for Hadoop clusters using Ganglia and Ambari.
Environment: HDP 2.6, Jenkins, Git, NiFi 1.8.0, Spark, Map Reduce, Talend, Hive, Pig, Zookeeper, Kafka, HBase, VMware ESX Server, Flume, Sqoop, Oozie, Kerberos, Sentry, AWS, Cent OS.
Confidential - Austin TX
- Created and maintained Technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.
- Have done monitoring and reviewing Hadoop log files and written queries to analyze them.
- Conducted POC's and mocks with client to understand the Business requirement, also attended defect triage meeting with UAT team and QA team to ensure defects are resolved in timely manner. Used Amazon CLI for data transfers to and from Amazon S3 buckets.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
- Worked with Kafka for the proof of concept for carrying out log processing on a distributed system. Stored data in tabular formats using Hive tables and Hive SerDes.
- Understanding the existing Enterprise data warehouse set up and provided design and architecture suggestion converting to Hadoop using MRv2, HIVE, SQOOP and Pig Latin.
- Loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop, Flume and load into Hive tables, which are partitioned.
- Developed HQL, Mappings, tables, external tables in Hive for analysis across different banners and worked on partitioning, optimization, compilation and execution.
- Written complex queries to get the data into HBase and responsible for executing hive queries using Hive Command Line, HUE. Used Spark for series of dependent jobs and for iterative algorithms. Developed a data pipeline using Kafka and Spark Streaming to store data into HDFS.
- Integrated Apache Kafka with Elastic search using Kafka Elastic search Connector to stream all messages from different partitions and topics into Elastic search for search and analysis.
- Used Pig to parse the data and store it in Avro format.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis. Developed Kafka producer and consumers for message handling.
- Created and maintained various Shell and Python scripts for automating various processes and optimized MapReduce code, pig scripts and performance tuning and analysis.
- Executed Hadoop jobs on AWS EMR using programs, data stored in S3 Buckets.
- Involved in creating UNIX shell scripts for database connectivity and executing queries in parallel job execution.
- Involved in importing the real-time data to Hadoop using Kafka and implemented Oozie jobs for daily imports. Enveloped and written Apache Pig scripts and Hive scripts to process the HDFS data.
- Designed and implemented incremental imports into Hive tables.
- Involved in Unit testing and delivered Unit test plans and results documents using Junit and MRUnit. Developed custom aggregate UDF's in Hive to parse log files.
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Exported the analyzed data to the relational databases using Sqoop for visualization.
- Analyzed large and critical datasets using HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, and Zookeeper. Worked on debugging, performance tuning of Hive & Pig Jobs.
- Identified the required data to be pooled to HDFS and created Sqoop scripts which were scheduled periodically to migrate data to the Hadoop environment.
- Involved with File Processing using Pig Latin.
- Created MapReduce jobs involving combiners and petitioners to deliver better results and worked on application performance optimization for an HDFS cluster.
Environment: HDP, Ambari, HDFS, MapReduce, Yarn, Hive, NiFi, Flume, PIG, Zookeeper, TEZ, Oozie, MYSQL, Puppet, and RHEL.
Confidential - NY
- Developed spark scripts by using java as per requirements.
- Worked with spark core, Spark Streaming and sparkSQL modules of Spark
- Developed multiple POCs using Spark and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
- Developed Kafka producer and consumers, Cassandra clients and Spark along with components on HDFS, Hive.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop.
- Automated Sqoop incremental imports by using Sqoop jobs and automated the jobs using Oozie
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using HQL.
- Involved in defining job flows using Oozie for scheduling jobs to manage apache Hadoop jobs.
- Developed python and shell scripts to schedule the processes running on a regular basis.
- Developed several advanced Map Reduce programs in Java as part of functional requirements for Big Data
- Developing Hive User Defined Functions in java, compiling them into jars and adding them to the HDFS and executing them with Hive Queries.
- Experienced in managing and reviewingHadooplog files.
- Tested and reported defects in an Agile Methodology perspective.
- Installed Hadoopecosystems (Hive, Pig, Sqoop, HBase, Oozie) on top of Hadoop cluster
- Involved in importing data from SQL to HDFS and Hive for analytical purpose.
- Implemented the workflows using Oozie framework to automate tasks.
Environment: CDH5, Hue, Eclipse, Centos Linux, HDFS, MapReduce, Kafka, Python, Scala, Java, Hive, Sqoop, Spark, Spark-SQL, Spark-Streaming, HBase, Oracle10g, Oozie, Red Hat Linux.
- Worked on developing backend services for various dealer services.
- Service layer is developed using Core JAVA, spring, Rest and SOAP web services.
- Designed and developed Jersey based Restful web services for customer health history analysis services.
- Developed Service layer with REST and SOAP Web Services and tested services using SOAP UI tool with test data for requests.
- Developed SOAP web services with contract first approach by designing XSD, WSDL and integrating with build to generate the necessary artifacts.
- Implemented Data Access layer using DAO pattern with JPA to use Hibernate as ORM tool.
- Used Spring Aspects for implementing functionalities like logging functionality.
- Designed and implemented transaction management using Spring AOP.
- Designed and implemented application using JSP, Spring MVC, Spring IOC, Spring Annotations, Spring AOP, Spring Transactions, Hibernate, SQL, Maven, Oracle.
- Used Hibernate framework for back end development and spring dependency injection for middle layer development.
- Developed JMS messaging queues for the asynchronous communication.
- Installed and configured JBOSS on Linux server for Dev and QA environments.
- Deployed the application on JBOSS in DEV and QA environments.
- Used SVN for version control. Resolved merge conflicts while working different branches.
- Developed a logging component using Apache Log4J to log messages and errors.
- Debugged the QA issues and tested on QA and UAT servers.
- Used putty to connect to Linux servers to analyze the application logs in resolving the applications errors and issues.
- Maven is used as a build framework and Jenkins for the continuous integration.
Environment: Core Java, spring (Core, ORM), JSP, AJAX, HTML5, CSS3, Spring MVC, Eclipse IDE, JBOSS, Oracle, Maven, Windows7, SVN, Linux.