- 8+ years of practical software development with 4+ years as Hadoop developer in Big Data/Hadoop/Spark technology development.
- Experience in developing applications that perform large scale distributed data processing using big data ecosystem tools like HDFS, YARN, Sqoop, Flume, Kafka, MapReduce, Pig, Hive, Spark, Spark SQL, Spark Streaming, HBase, Cassandra, MongoDB, Mahout, Oozie, and AWS.
- Good functional experience in using various Hadoop distributions like Hortonworks, Cloudera, and EMR
- Good understanding in using data ingestion tools - such as Kafka, Sqoop and Flume.
- Experienced in performing in-memory real time data processing using Apache Spark.
- Good experience in developing multiple Kafka Producers and Consumers as per business requirements.
- Extensively worked on Spark components like Spark SQL, MLlib, GraphX, and Spark Streaming.
- Configured Spark Streaming to receive real time data from Kafka and store the stream data to HDFS and process it using Spark and Scala.
- Developed quality code adhering to Scala coding standards and best practices.
- Experience in migrating map reduce programs into Spark RDD transformations, actions to improve performance.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Extensive working experience with data warehousing technologies such as HIVE.
- Good experience on partitions, Bucketing concepts. Designed and managed them and created external tables in Hive to optimize performance.
- Great experience in data analyzation using HiveQL, Pig Latin, HBase and custom MapReduce programs in Java.
- Expertise in writing Hive and Pig queries for data analysis to meet the business requirement.
- Extensively worked on Hive and Sqoop for sourcing and transformations.
- Extensive work experience in creating UDFs, UDAFs in Pig and Hive.
- Good experience in using Impala for data analysis.
- Experience on NoSQL databases such as HBase, Cassandra, MongoDB, and DynamoDB.
- Implemented CRUD operations using CQL on top of Cassandra file system.
- Experience in creating data-models for client’s transactional logs, analyzed the data from Cassandra tables for quick searching, sorting, and grouping using the Cassandra Query Language (CQL).
- Expert knowledge on MongoDB data modeling, tuning, disaster recovery and backup.
- Experience in Monitoring of Document growth and estimating storage size for large MongoDB clusters depends on data life cycle management.
- Hands on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
- Expertise in relational databases like MySQL, SQL Server, DB2, and Oracle.
- Great understanding on Solr to develop search engine on unstructured data in HDFS.
- Experience in cloud platforms like AWS, Azure.
- Real time exposure to AWS command line interface, and AWS data pipeline.
- Extensively worked on AWS services such as EC2 instance, S3, EMR, Cloud Formation, Cloud Watch, and Lambda.
- Expertise in writing map reduce programs in Java for data extraction, transformation, and aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet and other formats.
- Good knowledge in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Experience on ELK stack and Solr to develop search engine on unstructured data in HDFS.
- Implemented ETL operations using Big Data platform.
- Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
- Involved in identifying job dependencies to design work flow for Oozie & YARN resource management.
- Experience working with Core Java, J2EE, JDBC, ODBC, JSP, Java Eclipse, EJB and Servlets.
- Strong experience on Data Warehousing ETL concepts using Informatica, and Talend.
- Experience in using bug tracking and ticketing systems such as JIRA, and Remedy.
- Hands on experience on build tools like Maven, JUnit, and Ant.
- Highly involved in all facets of SDLC using Waterfall and Agile Scrum methodologies.
Big Data/ Hadoop: HDFS, MapReduce, Pig, Hive, Spark, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, YARN, Hue.
Hadoop Distributions: Cloudera (CDH4, CDH5), Hortonworks, EMR
Programming Languages: C, Java, Python, Scala.
Database/NoSQL: HBase, Cassandra, MongoDB, MySQL, Oracle, DB2, PL/SQL, Microsoft SQL Server
Cloud Services: AWS, Azure
Frameworks: Spring, Hibernate, Struts
Java Technologies: Servlets, JavaBeans, JSP, JDBC, EJB
Application Servers: Apache Tomcat, Web Sphere, WebLogic, JBoss
ETL Tools: Informatica, Talend
Confidential, Austin, TX
Sr. Hadoop/Spark Developer
- Worked on Kafka and Spark integration for real time data processing.
- Responsible for design & deployment of Spark SQL scripts and Scala shell commands based on functional specifications.
- Used Kafka for log aggregation to collect physical log files from servers and put them in the HDFS for further processing.
- Experience in Configure, Design, Implement and monitor Kafka Cluster and connectors.
- Developed Kafka producer and consumer components for real time data processing.
- Implemented Spark using Scala, and Spark SQL API for faster processing data.
- Used Spark for interactive queries, processing the stream data and integration with Cassandra for huge volume of data.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Written Spark scripts to accept the events from Kafka producer and emit into Cassandra.
- Performed unit testing using JUnit.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Worked on AWS cloud services like EC2, S3, EBS, RDS and VPC.
- Wrote Java code to format XML documents; upload them to Solr server for indexing.
- Analyzed the data by performing Hive queries on existing database. Designed and Implemented partitioning (Static, Dynamic), Buckets in Hive.
- Created Hive Generic UDF’s to process business logic that varies based on policy.
- Load and transform large sets of structured, semi structured using Hive.
- Involved in creating Hive tables, loading the data and writing Hive queries that will run internally in Map Reduce way.
- Extended Hive and Pig core functionality by writing custom UDFs.
- Used AWS to export MapReduce jobs into Spark RDD transformations.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performed necessary transformations and aggregations to build the data model and persists the data in HDFS.
- Implemented intermediate functionalities like events or records count from the Kafka topics by writing Spark programs in Java and Scala.
- Worked on Cassandra in creating Cassandra tables to load large set of semi structured data coming from various sources.
- Involved in Cassandra Data modelling to create key spaces and tables in multi Data Center DSE Cassandra DB.
- Ingested the data from Relational databases such as MySQL, Oracle, and DB2 to HDFS using Sqoop.
- Involved in identifying job dependencies to design workflow for Oozie and YARN resource management.
- Worked on Talend with Hadoop and improved the performance of the Talend jobs.
- Designed and developed various SSIS packages (ETL) to extract, transform data & involved in scheduling SSIS packages.
- Set up solr for searching and routing the log data.
- Extensively used Zookeeper as job scheduler for Spark jobs.
- Added security to the cluster by integrating Kerberos.
- Understanding of Kerberos authentication in Oozie workflow for Hive and Cassandra.
- Utilized Container technology like Docker along with Mesos and aurora to manage whole cluster of hosts.
- Created Tableau visualization for the internal management.
- Experience with different kind of compression techniques such as LZO, GZip, Snappy.
- Dealt with Jira as ticket tracking and work flow.
- Involved in sprint planning, code review and daily standup meetings to discuss the progress of the application.
- Effectively followed Agile Scrum methodology to design, develop, deploy and support solutions that leverage the client big data platform.
Environment:: Apache Spark, Scala, Hive, Cloudera, Apache Kafka, Sqoop, Cassandra, MySQL, Oracle, DB2, Spark Streaming, Java 8, Python, Agile, Talend, AWS-(EC2, S3, EBS, RDS, VPC), ETL, Tableau, Kerberos, Jira, Mesos, Solr.
Confidential, Cincinnati, OH
- Handled importing of data from various data sources into HDFS, and performed transformations using Hive.
- Continuous monitoring and managing the Hadoop cluster through HDP (Hortonworks Data Platform).
- Used Flume to stream through the log data from various sources.
- Configured Flume to extract the data from the web server output files to load into HDFS.
- Involved in loading data from UNIX/LINUX file system and FTP to HDFS.
- Analyzed and planned the migration of the applications and monitoring of the Azure Compute Infrastructure using SCOM and OMS.
- Developed and implemented Hive queries and functions for evaluation, filtering, and sorting of data.
- Analyzed the data by performing Hive queries and running Pig scripts.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Handled different type of Joins in Hive such as Inner Join, Left outer join, Right Outer Join, and Full Outer Join.
- Involved in developing Impala scripts to do Adhoc queries.
- Defined Accumulo tables and loaded data into tables for near real-time reports.
- Created Hive external tables using Acccumulo connector.
- Developed Simple to complex Map/reduce Jobs using Hive, Pig and Python.
- Optimized the Hive queries using Partitioning and Bucketing techniques for controlling the data distribution.
- Supported the existing MapReduce Programs those are running on the cluster.
- Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
- Developed Pig scripts and UDF's as per the Business logic.
- Worked on No-SQL database MongoDB in storing images and URIs.
- Managed and reviewed Hadoop MongoDB log files.
- Performed data analysis with MongoDB using Hive External tables. Exported the analyzed data using Sqoop and to generate reports for the BI team.
- Set up Elastic Search and Logstash and Kibana for searching and routing the log data.
- Designed and implemented Spark jobs to support distributed data processing.
- Worked on NiFi to automate the data movement between different Hadoop systems.
- Designed and implemented custom NiFi processors that reacted, processed for the data pipeline.
- Designed Cluster co-ordination services through Zookeeper.
- Used Amazon DynamoDB to gather and track the event based metrics.
- Designed ETL processes using Informatica to load data from flat files, and excel files.
- Involved in ETL process for design, development, testing and migration to production environments.
- Involved in writing the ETL test scripts and guided the testing team in executing the test scripts.
- Worked on MongoDB for distributed storage and processing.
- Used Hive to analyze partioned and bucketed data and compute various metrics for reporting.
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- Followed Agile Scrum methodology for the entire project.
- Used Remedy for tracking the work flow and raising the requests.
Environment:: HDFS, Flume, Sqoop, Hive, Pig, Oozie, Python, Shell Scripting, SQL, MongoDB, DynamoDB, Linux, Unix, NiFi, AWS (EC-2, VPC), Talend, ETL, Elastic Search, Logstash, Zookeeper, Hortonworks, Agile Scrum, Remedy.
Confidential, Cincinnati, OH
- Developed solutions to process data into HDFS, process within Hadoop and emit the summary results from Hadoop to downstream systems.
- Installed and configured Hadoop MapReduce, developed multiple MapReduce jobs for cleansing and preprocessing.
- Worked and written Hadoop MapReduce jobs to run on Amazon EMR clusters and creating workflows for running jobs.
- Worked on Sqoop extensively to ingest data from various source systems into HDFS.
- Imported data from different relational data sources like Oracle, MySQL to HDFS using Sqoop.
- Analyzed substantial data steps using Hive queries and Pig scripts.
- Written Pig scripts for sorting, joining, and grouping data.
- Integrated multiple sources of data (SQL Server, DB2, MySQL) into Hadoop cluster and analyzed data by Hive-HBase integration.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc.
- Played a major role in working with the team to leverage Sqoop for extracting data from Oracle.
- Solved small file problem using Sequence files processing in MapReduce.
- Implemented counters on HBase data to count total records on different tables.
- Created HBase tables to store variable data formats coming from different portfolios. Performed real time analytics on HBase using Java API and Rest API.
- Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
- Experienced with different scripting language like Python and shell scripts
- Oozie and were used to automate the flow of jobs and coordination in the cluster respectively.
- Worked on different file formats like Text files, Parquet, Sequence Files, Avro, Record columnar files (RC).
- Understood complex data structures of different type (structured, semi structured) and de-normalizing for storage in Hadoop.
- Experienced with working on Avro Data files using Avro Serialization system.
- Kerberos security was implemented to safeguard the cluster.
Environment:: HDFS, Pig, MapReduce, Sqoop, Oozie, Zookeeper, HBase, Java Eclipse, Python, MySQL, Oracle, Shell Scripting, Kerberos, EMR, Oozie, Zookeeper, EMR, SQL Server, DB2, MySQL.
Confidential, Denver, CO
- Designed Map Reduce programs on log data to transform into structured way to find specific details like user location, age group, spending time.
- Experienced in using Flume to collect, aggregate, and store the web log data from different sources like web servers and network devices and stored into HDFS.
- Analyzed Hadoop cluster and different big data analytic tools including Map Reduce, Pig, Hive
- Actively monitored systems architecture design and implementation and configuration of Hadoop deployment, backup, and disaster recovery systems and procedures.
- Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
- Designed and developed Data Ingestion component.
- Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and HBase
- Monitored multiple Hadoop clusters environments using Ganglia.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most purchased product on website.
- Moving bulk amount of data into HBase using Map Reduce Integration.
- Implemented test scripts to support test driven development and continuous integration.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data.
- Responsible to manage data coming from different sources.
Environment:: Hadoop-HDFS, Pig, Sqoop, HBase, Hive, Flume, MapReduce, Oozie and MySQL.
- Extensive Involvement in analyzing the requirements and detailed system study.
- Involved in the analysis, design, and development phase of Software Development Lifecycle.
- Involved in the design and creation of Class diagrams, Sequence diagrams and Activity Diagrams using UML models.
- Involved in developing JSP pages using Struts custom tags, jQuery and Tiles Framework.
- Implemented web Services using Core Java, SOAP.
- Involved in Development of Enterprise application using J2EE, Spring, JSP.
- Responsible for design and developing Persistence classes suing Spring boot data template.
- Developed Spring Boot framework for future application development.
- Worked on SOA architecture using SOAP protocol.
- Responsible for the performing Oracle procedures and SQL queries.
- Used Hibernate, DAO (Data Access Object), and JDBC for data retrieval from database.
- Developed JUnit Test cases for Unit Test cases and as well as System and User test scenarios.
- Involved in Unit Testing, User Acceptance Testing and Bug Fixing.
- Actively used the defect tracking tool Jira to create and track the defects during QA phase of the project.
- Followed Agile methodology in analyze, define, and document the application which will support functional and business requirements.
- Participated in the analysis, design, and development phase of Software Development Lifecycle.
- Designed and developed server side J2EE components and used Struts framework for the web application to adopt MVC architecture.
- Created SQL queries, Sequences, Views for the backend database in Oracle database.
- Extensively used Java multi-threading to implement batch Jobs with JDK 1.5 features
- Implemented clustering of Oracle and WebLogic server to achieve High availability and Load balancing.
- Configured the project on WebLogic 10.3 application server.
- Used Log4j package for debugging, info and error tracings.
- Deployed applications on JBoss 4.0 server.
- Used Oracle 10g as the backend database and written PL/SQL scripts.
- Involved in creation of SQL scripts to create, update and delete data from the tables.
- Involved in collecting client requirements and preparing the design documents.
- Created new and maintained existing web pages built in JSP, Servlet.
Environment:: Java, J2EE, Servlets, Struts 1.1, JSP, WebLogic, Oracle, CVS, PL/SQL, Eclipse, Linux.