- Almost 8 years of experience as Big Data/Hadoop with Java skills in analysis, design, development, testing and deploying various software applications.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice - versa.
- Good knowledge in using Hibernate for mapping Java classes with database and using Hibernate Query Language (HQL).
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Experience in developing custom UDF's for Pig and Apache Hive to in corporate methods and functionality of Java into PigLatin and HiveQL.
- Good experience in developing MapReduce jobs in J2EE /Java for data cleansing, transformations, pre-processing and analysis.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2webservices which provides fast and efficient processing of Teradata BigData Analytics.
- Experience in collection of LogData and JSON data into HDFS using Flume and processed the data using Hive/Pig.
- Experience working with cloud platforms, setting up environments and applications on AWS, automation of code and infrastructure (DevOps) using Chef and Jenkins
- Extensive experience on developing Spark Streaming jobs by developing RDD's (Resilient Distributed Datasets) and used Spark SQL as required.
- Experience on developing JAVA MapReduce jobs for data cleaning and data manipulation as required for the business.
- Strong knowledge on Hadoop eco systems including HDFS, Hive, Oozie, HBase, Pig, Sqoop, Zookeeper etc.
- Extensive experience with advanced J2EE Frameworks such as spring, Struts, JSF and Hibernate.
- Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of MapR.
- Extensive experience in working with Oracle, MSSQL Server, DB2, MySQL.
- Experience working with Horton works and Cloudera environments.
- Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
- Excellent experience in installing and running various Oozie workflows and automating parallel job executions.
- Experience on Spark and Spark SQL, Spark Streaming, Spark GraphX, Spark Mlib.
- Extensively development experience in different IDE like Eclipse, Net Beans, IntelliJ and STS.
- Strong experience in core SQL and Restful web services (RWS).
- Strong knowledge in NOSQL column oriented databases like HBase and its integration with Hadoop cluster.
- Good experience in Tableau for Data Visualization and analysis on large datasets, drawing various conclusions.
- Experience in using Python, R for statistical analysis.
- Good knowledge of coding using SQL, SQLPlus, T-SQL, PL/SQL, Stored Procedures/Functions.
- Worked on Bootstrap, AngularJS and NodeJS, knockout, ember, Java Persistence Architecture (JPA).
- Experienced in developing applications using all Java/J2EE technologies like Servlets, JSP, EJB, JDBC, JNDI, JMS, SOAP, REST, GRAILS etc.
- Well versed working with Relational Database Management Systems as Oracle12c, MSSQL, MySQL Server.
- Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
Hadoop/Big Data Technologies: Hadoop 2.7/2.5, HDFS, MapReduce, HBase 1.2.4, Pig, Hive 2.0, Hue, Sqoop, Spark2.0/2.0.2, Impala, Oozie, YARN, Flume 1.7, Kafka, Zookeeper
Hadoop Distributions: Cloudera 5.9, Hortonworks, MapR
Programming Language: Java, Scala, Python 3.5, SQL, PL/SQL, Shell Scripting, Storm, JSP, Servlets
Frameworks: Spring 4.3, Hibernate, Struts, JSF, EJB, JMS
Confidential, Boston, MA
Tool: MS Visio, JIRA, Confluence, Putty, Eclipse Neon, Maven, WinSCP, Notepad++, Spotfire, Tableau, Attivio, draw.io
- As a Big Data Developer, I worked on Hadoop eco-systems including Hive, HBase, Oozie, Pig, Zookeeper, Spark Streaming MCS (MapR Control System) and so on with MapR distribution.
- Installed and configured Hadoop Map Reduce, HDFS, Developed multiple Map Reduce jobs in Java for data cleaning and Pre-processing.
- Primarily involved in Data Migration process using Azure by integrating with Github repository and Jenkins.
- Built code for real time data ingestion using Java, Map R-Streams (Kafka) and STORM.
- Involved in various phases of development analysed and developed the system going through Agile Scrum methodology.
- Worked on Apache Solr which is used as indexing and search engine.
- Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
- Worked on analysing Hadoop stack and different Bigdata tools including Pig and Hive, Hbase database and Sqoop.
- Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Worked with different data sources like Avro data files, XML files, JSON files, SQL server and Oracle to load data into Hive tables.
- Used J2EE design patterns like Factory pattern & Singleton Pattern.
- Used Spark to create the structured data from large amount of unstructured data from various sources.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Performed transformations, cleaning and filtering on imported data using Hive, Map Reduce, Impala and loaded final data into HDFS.
- Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
- Experienced in designing and developing POC's in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Responsible for coding Map Reduce program, Hive queries, testing and debugging the Map Reduce programs.
- Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
- Involved in the process of data acquisition, data pre-processing and data exploration of telecommunication project in Scala.
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
- Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
- Imported weblogs & unstructured data using the Apache Flume and stores the data in Flume channel.
- Exported event weblogs to HDFS by creating a HDFS sink which directly deposits the weblogs in HDFS.
- Used RESTful web services with MVC for parsing and processing XML data.
- Utilized XML and XSL Transformation for dynamic web-content and database connectivity.
- Involved in loading data from UNIX file system to HDFS. Involved in designing schema, writing CQL's and loading data using Cassandra.
- Built the automated build and deployment framework using Jenkins, Maven etc.
Confidential, Brea, California
Tool: AWS SDK, Putty, Eclipse Mars, Maven, ConceptDraw, SQL Developer, SQL Server, Spotfire, FileZilla, Notepad++
- Involved in Analysis, Design, System architectural design, Process interfaces design, design documentation.
- Responsible for developing prototypes the selected solutions and implementing complex big data projects with a focus on collecting, parsing, managing, analysing and visualizing large sets of data using multiple platforms.
- Understand how to apply technologies to solve bigdata problems and to develop innovative big data solutions.
- Developed Spark Applications by using Scala , Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra .
- Responsible for analysing and cleansing raw data by performing Hive queries and running Pig scripts on data.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS .
- Developed Simple to complex Map Reduce Jobs using Hive and Pig .
- Performed importing data from various sources to the Cassandra cluster using Sqoop . Worked on creating data models for Cassandra from Existing Oracle data model.
- Used Spark - Cassandra connector to load data to and from Cassandra .
- Worked in Spark and Scala for Data Analytics . Handle ETL Framework in Spark for writing data from HDFS to Hive .
- Used Scala based written framework for ETL .
- Developed multiple spark streaming and core jobs with Kafka as a data pipe-line system
- Worked and learned a great deal from AWS Cloud services like EC2, S3, and EBS .
- Migrated an existing on-premises application to AWS . Used AWS services like EC2 and S3 for small data sets processing and storage.
- Imported data from AWSS3 into Spark RDD , Performed transformations and actions on RDD's .
- Extensively use Zookeeper as job scheduler for Spark Jobs .
- Worked on Talend with Hadoop . Worked in migrating from Informatica Talend jobs .
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
- Developed Kafka producer and consumer components for real time data processing.
- Worked on physical transformations of data model which involved in creating Tables, Indexes, Joins , Views and Partitions .
- Involved in Cassandra Data modelling to create key spaces and tables in multi Data Centre DSE Cassandra DB.
Confidential, Overland Park, Kansas
- Performed data transformations like filtering, sorting, and aggregation using Pig
- Creating Sqoop queries to import data from SQL, Oracle, and Teradata to HDFS
- Created Hive tables to push the data to Mongo DB.
- Wrote complex aggregate queries in mongo for report generation.
- Developed scripts to run scheduled batch cycles using Oozie and present data for reports
- Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
- Developed bigdata ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using SparkScala API and Spark.
- Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoop streaming, Apache Spark, Spark SQL, Scala, Hive, and Pig.
- Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
- Performed data validation and transformation using Python and Hadoop streaming.
- Developed highly efficient PigJava UDFs utilizing advanced concept like Algebraic and Accumulator interface to populate ADP Benchmarks cube metrics.
- Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
- Developed bash scripts to bring the TLOG file from ftp server and then processing it to load into hive tables.
- Automated workflows using shell scripts and Control-M jobs to pull data from various databases into Hadoop DataLake.
- Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
- Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
- Designed, developed and maintained BigData streaming and batch applications using Storm.
- Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Developed pig scripts to transform the data into structured format and it are automated through Oozie coordinators.
- Used Splunk to captures, indexes and correlates real-time data in a searchable repository from which it can generate reports and alerts.
Confidential, Plano, Texas
Big Data Developer
- Worked on Spark SQL to handle structured data in Hive.
- Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
- Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau.
- Worked on complex MapReduce program to analyses data that exists on the cluster.
- Analysed substantial data sets by running Hive queries and Pig scripts.
- Written Hive UDFs to sort Structure fields and return complex data type.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and MapReduce) and move the data inside and outside of HDFS.
- Creating files and tuned the SQL queries in Hive utilizing HUE.
- Involved in collecting and aggregating large amounts of log data using Storm and staging data in HDFS for further analysis.
- Created the Hive external tables using Accumulo connector.
- Managed real time data processing and real time Data Ingestion in MongoDB and Hive using Storm.
- Created custom SOLR Query segments to optimize ideal search matching.
- Developed Spark scripts by using Python shell commands.
- Stored the processed results In Data Warehouse, and maintaining data using Hive.
- Worked with Spark eco system using Spark SQL and Scala queries on different formats like Text file, CSV file.
- Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
- Worked with NoSQL databases like MongoDB in making MongoDB tables to load expansive arrangements of semi structured data.
- Developed Spark scripts by using Python shell commands as per the requirement.
- Installed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability.
- Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, and EMR.
Big Data Developer
- Worked on Hadoop cluster, which ranged from 30 nodes in development stage, 40 nodes in pre-production and 140 nodes in production.
- Responsible to manage data coming from different sources and importing structured and unstructured data.
- Implemented complex Map Reduce programs to perform joins on the Map side using distributed cache in java.
- Developed Map Reduce programs and Hive queries to analyse shipping pattern and customer satisfaction index over the history of data.
- Experience in Writing PIG User Define Function and Hive UDFS.
- Pig Scripts are utilized the Sequence File and HCatalog for better performance.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop.
- Created Map Reduce programs to handle semi/unstructured data like xml, Json, Avro data files and sequence files for log files.
- Used SQOOP to import the data from RDBMS to HDFS to achieve the reliability of data.
- Implemented POC for using APACHEIMPALA for data processing on top of HIVE
- Responsible for managing and reviewing Hadoop log files. Designed and developed data management system using MySQL.
- Developed pig scripts to pro Used Pig to do transformations, event joins, filter boot traffic and some pre-aggregations before storing the data onto HDFS.
- Involved in developing Pig Scripts for change data capture and delta record processing between newly arrived data and already existing data in HDFS.
- Supported Map Reduce Programs running on the cluster.
- Installed and configured Proof of Concepts (POC) environments for Map Reduce, Hive, Oozie, Flume, HBase and other major components of Hadoop distributed system.
- We used flume to transport the large amounts of streaming data into HBase.
- Developed Map Reduce programs in Java for data analysis and data cleaning.