- 8+ years of professional IT work experience in analysis, design, development, testing and implementation of Hadoop, Bigdata Technologies like Hadoop and spark ecosystems, Data Warehousing, and AWS on Object Oriented Programming.
- Having 4+ years of comprehensive experience in Bigdata using Hadoop and its ecosystem components like HDFS, Spark with Scala and python, Zookeeper, Yarn, MapReduce, Pig, Sqoop, HBase, Hive, Flume, Oozie, Kafka, Flume, Spark streaming and TEZ.
- Worked on NoSQL databases like MongoDB, HBase, Cassandra.
- Experience in Data Modeling and working with Cassandra Query Language (CQL).
- Experience on using Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
- Hands on experience in querying and analyzing data from Cassandra for quick searching, sorting and grouping through CQL.
- Experience in implementing spark solution to enable real time reports from Cassandra data.
- Hands on expertise in working and designing of Row keys & Schema Design with NOSQL databases like Mongo DB.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Experience with NoSQL database by using Indexing, Replication and Sharding in MongoDB. Sorted the data by using indexing.
- Experienced with performing CRUD operations using HBase Java Client API.
- Expertise in implementing Ad - hoc queries using Hive QL and good knowledge in creating Hive tables and loading and analyzing data using hive queries.
- Experienced in working with structured data using HiveQL , join operations, Hive UDFs , partitions , bucketing and internal / external tables.
- Experienced in using Pig scripts to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
- Involved in Debugging Pig and Hive scripts and used various optimization techniques in MapReduce jobs. Wrote custom UDFs and UDAF for Hive and Pig core functionality.
- Worked on relative ease with different working strategies like Agile, Waterfall, Scrum, and Test-Driven Development (TDD) methodologies.
- Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
- Hands on experience with AWS components like EC2, S3, Data Pipeline, RDS, RedShift and EMR.
- Imported the data from different sources like AWSS3, Local file system into Spark RDD and worked on cloud Amazon Web Services (EMR, S3, EC2, Lambda).
- Experience with developing and maintaining Applications written for Amazon Simple Storage, AWS Elastic Beanstalk, and AWS Cloud Formation.
- Hands on experience in working with Flume to load the log data from multiple web sources directly into HDFS.
- Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS.
- Uploaded and processed terabytes of data from various structured and unstructured sources into HDFS (AWS cloud) using Sqoop .
- Had a very good exposure working with various File-Formats ( Parquet, Avro & JSON ) and Compressions ( Snappy & Gzip ).
- Hands on experience in creating and designing data ingest pipelines using technologies such as Apache Storm- Kafka.
- Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
- Hands on experience with Spark Core, Spark SQL and Data Frames/Data Sets/RDD API.
- Replaced existing map-reduce jobs and Hive scripts with Spark Data-Frame transformation and actions. Good knowledge on Spark architecture and real-time streaming using Spark with Kafka.
- Experienced working with Spark Streaming, SparkSQL and Kafka for real-time data processing.
- Created dataflow between SQL Server and Hadoop clusters using Apache Nifi.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing livestreaming information with the help of RDD.
- Experience in developing and scheduling ETL workflows in Hadoop using Oozie with the help of deployment and managing Hadoop cluster using Cloudera and Horton works.
- Used Oozie and Zookeeper operational services for coordinating cluster and scheduling workflows.
- Experience with version control tools like Git, CSV and SVN.
- Excellent experience in designing and developing Enterprise Applications for J2EE platform using Servlets, JSP, Struts, Spring, Hibernate and Web services.
Hadoop Ecosystem: HDFS, YARN, Scala, Map Reduce, Hive, Pig, Zookeeper, Sqoop, Oozie, Flume, Kafka, Impala, Nifi, MongoDB, Cassandra, HBase.
Databases: Oracle, MS-SQL Server, MySQL, PostgreSQL, NoSQL (HBase, Cassandra, MongoDB), Teradata.
IDE and Tools: Eclipse, NetBeans, Informatica, IBM DataStage, Talend, Maven, Jenkins.
Hadoop Platforms: Hortonworks, Cloudera, Azure, Amazon Web services (AWS).
Operating Systems: Windows XP/2000/NT, Linux, UNIX.
Amazon Web Services: EMR, EC2, S3, RDS, Cloud Search, Data Pipeline, Lambda.
Version Control: GitHub, SVN, CVS.
Packages: MS Office Suite, MS Vision, MS Project Professional.
Java Technologies: Servlets, JavaBeans, JSP, JDBC, and Spring MVC.
Languages: Python, Java, and Scala.
Confidential, Warren, NJ
Sr. Hadoop Developer
- Working on Big Data infrastructure for batch processing as well as real-time processing. Responsible for building scalable distributed data solutions using Hadoop.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Involved in designing Kafka for multi data center cluster and monitoring it.
- Responsible for importing real time data to pull the data from sources to Kafka clusters.
- Responsible for design and development of Spark SQL Scripts based on Functional Specifications.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, and Scala.
- Developed Spark Applications by using Scala, Java, Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Used Spark SQL on data frames to access hive tables into spark for faster processing of data.
- Working knowledge of Spark RDD, Data Frame API, Data set API, Data Source API, Spark SQL and Spark Streaming.
- Used Different Spark Modules like Spark core, Spark SQL, Spark Streaming, Spark Data sets and Data frames.
- Responsible for developing multiple Kafka Producers and Consumers from scratch as per the software requirement specifications.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive.
- Involved in creating data-lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers.
- Created various hive managed tables, staging tables and joined the tables as per the requirement. Implemented static Partitioning , Dynamic partitioning and Bucketing in Hive using internal and external table.
- Moved Relational Database data using Sqoop into Hive Dynamic partition tables using staging tables.
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume.
- Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud ( EC2 ) and Amazon Simple Storage Service ( S3 ).
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Successfully migrated the data from AWS S3 source to the HDFS sink using Kafka.
- Queried and analyzed data from Cassandra for quick searching, sorting and grouping through CQL.
- Involved in performing importing data from various sources to the Cassandra cluster using Sqoop.
- Worked on creating data models for Cassandra from Existing Oracle data model.
- Designed Column families in Cassandra and Ingested data from RDBMS, performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Reading the log files using Elastic Search Logstash and alerting users on the issue and saving the alert details to Cassandra for analyzations.
- Used Impala where ever possible to achieve faster results compared to Hive during data Analysis.
- Experienced in running query using Impala and used BI tools to run ad-hoc queries directly on Hadoop.
- Worked extensively on Apache NiFi to build Nifi flows for the existing oozie jobs to get the incremental load, full load, semi structured data and to get data from rest API into Hadoop and automate all the Nifi flows runs incrementally.
- Created Nifi flows to trigger spark jobs and used put email processors to get notifications if there are any failures.
- Implemented Apache Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Used git as version control tool to update work process.
- Worked on apache Solr for indexing and load balanced querying to search for specific data in larger datasets.
- Implemented the workflows using Apache Oozie framework to automate tasks. Used Zookeeper to co-ordinate cluster services.
- Experience in using version control tools like GITHUB to share the code snippet among the team members.
- Involved in daily SCRUM meetings to discuss the development/progress and was active in making scrum meetings more productive.
Environment: Hadoop, Map Reduce, HDFS, Hive, Cassandra, Sqoop, Oozie, SQL, Kafka, Spark, Scala, Java, AWS, GitHub, Solr, Impala.
Confidential, Bellevue, WA
- Involved in implementation of Hadoop Cluster and Hive for Development and Test Environment.
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig , Hive and MongoDB. Extracted files from MongoDB through Sqoop and placed in HDFS for processed.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into spark for faster processing of data.
- Loaded the data into Spark RDD and do in memory data Computation to generate the faster Output response.
- Created the spark SQL context to load the data from hive tables into RDD’S for performing complex queries and analytics on the data present in data lake.
- Developed a Nifi Workflow to pick up the data from SFTP server and send that to Kafka broker. Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response.
- Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (MongoDB).
- Wrote queries in MongoDB to generate reports to display in the dash board.
- Worked on MongoDB database concepts such as locking, transactions, indexes, sharding, replication and schema design.
- Used MongoDB to store Bigdata and applied aggregation Match, Sort and Group operation in MongoDB.
- Hands on experience in writing custom UDF’s, custom input and output formats and created Hive Tables, loaded values and generated adhoc-reports using the table data.
- Showcased strong understanding on Hadoop architecture including HDFS, MapReduce, Hive, Pig, Sqoop and Oozie.
- Worked extensively on Hive to create, alter and drop tables and involved in writing hive queries.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
- Successfully migrated the data from AWS S3 source to the HDFS sink using Flume.
- Used Sqoop to import data from RDBMS to HDFS cluster using custom scripts.
- Involved in moving log files generated from various sources to HDFS for further processing through Flume.
- Used Pig as ETL tool to do transformations, event joins, filter both traffic and some pre-aggregations before storing the data onto HDFS.
- Wrote Hive and Pig scripts for joining the raw data with the lookup data and for some aggregative operations as per the business requirement.
- Good Knowledge in using NiFi to automate the data movement between different Hadoop systems. Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations.
- Strong experience in working with EMR and setting up environments on Amazon AWS EC2 instances.
- Unstructured files like XML’s, JSON files are processed using custom built java API and pushed into mongo DB.
- Writing MapReduce programs to convert text files into AVRO and loading into Hive table.
- Extensively used Zookeeper as job scheduler for Spark Jobs.
- Involved in analysis of source systems, business requirements and identification of business rule and responsible for developing, support and maintenance for the ETL process using Informatica.
- Created / updated ETL design documents for all the Informatica components changed.
- Migrated HiveQL queries on structured data into Spark QL to improve performance.
- Involved in running all the hive scripts through hive, Impala , Hive on Spark and some through Spark SQL.
- Set up Solr Clouds for distributing indexing and search.
- Worked on solr configuration and customizations based on requirements.
Environment: Hadoop, Map Reduce, Yarn, Hive, Pig, Flume, Sqoop, AWS, Core Java, Spark, Scala, MongoDB, Horton Works, Elastic Search 5.x, Eclipse.
Confidential, Wilmington, NC
Big Data Developer
- Importing and exporting data into HDFS, Pig, Hive and HBase using SQOOP.
- Experience in installing Hadoop cluster using different distributions of Cloudera distribution.
- Responsible for analyzing large data sets and derive customer usage patterns by developing new MapReduce programs.
- Written MapReduce code to parse the data from various sources and storing parsed data into Hbase and Hive.
- Created HBase tables to store different formats of data as a backend for user portals.
- Successfully migrated Legacy application to Big Data application using Hive/Pig/HBase in Production level.
- Load and transform large sets of structured, semi structured, and unstructured data that includes Avro, sequence files and XML files.
- Involved in gathering the requirements, designing, development and testing.
- Implemented helper classes that access HBase directly from java using Java API to perform CRUD operations.
- Handled different time series data using HBase to perform store data and perform analytics based on time to improve queries retrieval time.
- Integrated Map Reduce with HBase to import bulk amount of data into HBase using Map Reduce Programs.
- Developed simple and complex MapReduce programs in Java for Data Analysis.
- Hands on experience in working with Flume to process Real time data processing.
- Load data from various data sources into HDFS using Flume. Utilized Flume to filter out the input data read to retrieve only the data needed to perform analytics by implementing flume interception.
- Developed the Pig UDF'S to pre-process the data for analysis and developed Hive Scripts for implementing dynamic Partitions.
- Created Hive tables to store the processed results in a tabular format.
- Involved in loading and transforming large sets of Structured, Semi-Structured and Unstructured data and analyzed them by running Hive queries and PIG scripts using bags and tuples.
- Experienced in managing and reviewing the Hadoop log files, used Pig as ETL tool to do Transformations, even joins and some pre-aggregations before storing the data onto HDFS.
- Developed Pig scripts for data analysis and extended its functionality by developing custom UDF's.
- Worked with cloud services like AZURE and involved in ETL, Data Integration and Migration.
- Wrote Lambda functions in python for AZURE which invokes python scripts to perform various transformations and analytics on large data sets in EMR clusters.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Design technical solution for real-time analytics using HBase .
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
Environment: Java, Hadoop, HDFS, Hive, HBase, Pig, SQOOP, Oozie, MySQL, MapReduce, Linux, Eclipse, Cloudera.
Confidential, Panama City, FL
Jr. Hadoop Developer
- Involved in loading data from UNIX file system to HDFS.
- Worked extensively on Hive and written Hive UDFs.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Handled importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS.
- Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
- Exported the patterns analyzed back into Teradata using Sqoop.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Using Sqoop to load data from DB2 into HBASE environment.
- Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day.
- All the bash scripts are scheduled using Resource Manager Scheduler.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, and visit duration.
- Finding the solutions to the bottlenecks in high latency hive queries through analyzing log messages.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Extensively used MapReduce Design Patterns to solve complex MapReduce programs.
Environment: Hadoop, HDFS, Map Reduce, Sqoop, Hive, Pig, HBase, DB2, Oozie, MySQL and Eclipse.
- Involved in gathering and analyzing system requirements.
- Participated in major phases of software development cycle with requirement gathering, Unit testing, development, and analysis and design phases using Agile/SCRUM methodologies.
- Used Multithreading and exceptional handling in the development of applications.
- Developed application is based on the MVC-II Architecture using Apache Struts framework.
- Migrated some modules from VB6.0 to java.
- Designed and developed user interface screens using HTML, JQuery and JSP.
- Created and maintained the configuration of the Application Framework.
- Eclipse used as Java IDE tool for creating Action classes and XML files.
- Implemented the application with Spring Framework for implementing Dependency Injection and provide abstraction between presentation layer and persistence layer.
- Developed multiple batch jobs using Spring Batch to import files of different formats like XML, CVS etc.
- Involved in development of application using Rule Engine (Drools).
- Used Rule Engines in applications to replace and manage some of the business logic.
- Wrote business rules using Drools and business logic processing customs declarations.
- Monitored Logs files and troubleshooting environment variable in Linux boxes.
- Involved in maintenance of the application.
- Involved in all the phases of SDLC including Requirements Collection, Design & Analysis of the Customer Specifications, Development and Customization of the Application.
- Communicated with Project manager, client, stakeholder and scrum master for better understanding of project requirements and task delivery by using Agile Methodology.
- Involved in implementing all components of the application including database tables, server-side Java Programming and Client-side web programming.
- Designed and developed Web Services to provide services to the various clients using SOAP and WSDL.
- Involved in preparing technical Specifications based on functional requirements.
- Involved in development of new command Objects and enhancement of existing command objects using Servlets and Core java .
- Identified and implemented the user actions (Struts Action Classes) and forms (Struts Forms Classes) as a part of Struts framework .
- Responsible for coding SQL Statements and Stored procedures for back end communication using JDBC.
- Involved in documentation, review, analysis and fixed post production issues.
Environment: Java, J2EE, JDBC, Struts, JSP, jQuery, SOAP, Servlets, SQL, HTML, CSS, Java Script, DB2.