- 8+ years of practical software development with 4+ years as Hadoop developer in BigData/Hadoop/Spark technology development.
- Experience in developing applications that perform large scale distributed data processing using big data ecosystem tools like HDFS, YARN, Sqoop, Flume, Kafka, MapReduce, Pig, Hive, Spark, Spark SQL, Spark Streaming, HBase, Cassandra, MongoDB, Mahout, Oozie, and AWS.
- Good functional experience in using various Hadoop distributions like Hortonworks, Cloudera, and EMR
- Good understanding in using data ingestion tools - such as Kafka, Sqoop and Flume.
- Experienced in performing in-memory real time data processing using Apache Spark.
- Good experience in developing multiple Kafka Producers and Consumers as per business requirements.
- Extensively worked on Spark components like SparkSQL, MLlib, GraphX, and Spark Streaming.
- Configured Spark Streaming to receive real time data from Kafka and store the stream data to HDFS and process it using Spark and Scala.
- Developed quality code adhering to Scala coding standards and best practices.
- Experience in migrating map reduces programs into Spark RDD transformations, actions to improve performance.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Extensive working experience with data warehousing technologies such as HIVE.
- Good experience on partitions, Bucketing concepts. Designed and managed them and created external tables in Hiveto optimize performance.
- Great experience in data analyzation using HiveQL, Pig Latin, HBase and custom MapReduce programs in Java.
- Expertise in writing Hive and Pig queries for data analysis to meet the business requirement.
- Extensively worked on Hive and Sqoop for sourcing and transformations.
- Extensive work experience in creating UDFs, UDAFs in Pig and Hive.
- Good experience in using Impala for data analysis.
- Experience on NoSQL databases such as HBase, Cassandra, MongoDB, and DynamoDB.
- Implemented CRUD operations using CQL on top of Cassandra file system.
- Experience in creating data-models for client’s transactional logs, analyzed the data from Cassandra tables for quick searching, sorting, and grouping using the Cassandra Query Language (CQL).
- Expert knowledge on MongoDB data modeling, tuning, disaster recovery and backup.
- Experience in Monitoring of Document growth and estimating storage size for large MongoDB clusters depends on data life cycle management.
- Hands on experience on Ad-hoc queries, Indexing, Replication, Load balancing, Aggregation in MongoDB.
- Expertise in relational databases like MySQL, SQL Server, DB2, and Oracle.
- Great understanding on Solr to develop search engine on unstructured data in HDFS.
- Experience in cloud platforms like AWS, Azure.
- Real time exposure to AWS command line interface, and AWS data pipeline.
- Extensively worked on AWS services such as EC2 instance, S3, EMR, Cloud Formation, Cloud Watch, and Lambda.
- Expertise in writing map reduce programs in Java for data extraction, transformation, and aggregation from various file formats such as XML, JSON, CSV, Avro, Parquet and other formats.
- Good knowledge in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Experience on ELK stack and Solr to develop search engine on unstructured data in HDFS.
- Implemented ETL operations using Big Data platform.
- Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency.
- Involved in identifying job dependencies to design work flow for Oozie & YARN resource management.
- Experience working with Core Java, J2EE, JDBC, ODBC, JSP, Java Eclipse, EJB and Servlets.
- Experience in using bug tracking and ticketing systems such as JIRA, and Remedy.
- Hands on experience on build tools like Maven, JUnit, and Ant.
- Highly involved in all facets of SDLC using Waterfall and AgileScrum methodologies.
- Strong experience on Data Warehousing ETL concepts using Informatica, and Talend.
Big Data/ Hadoop: HDFS,MapReduce, Pig, Hive, Spark, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, YARN, Hue.
Hadoop Distributions: Cloudera (CDH4, CDH5), Hortonworks, EMR
Programming Languages: C, Java, Python, Scala.
Database/NoSQL: HBase, Cassandra, MongoDB, MySQL, Oracle, DB2, PL/SQL, Microsoft SQL Server
Cloud Services: AWS, Azure
Frameworks: Spring, Hibernate, Struts
Java Technologies: Servlets, JavaBeans, JSP, JDBC, EJB
Application Servers: Apache Tomcat, Web Sphere, WebLogic, JBoss
ETL Tools: Informatica, Talend
Confidential, Austin, TX
Sr. Hadoop/Spark Developer
- Worked on Kafka and Spark integration for real time data processing.
- Responsible for design & deployment of SparkSQL scripts and Scala shell commands based on functional specifications.
- Used Kafka for log aggregation to collect physical log files from servers and put them in the HDFS for further processing.
- Experience in Configure, Design, Implement and monitor Kafka Cluster and connectors.
- Developed Kafka producer and consumer components for real time data processing.
- Implemented Spark using Scala, and Spark SQL API for faster processing data.
- Used Spark for interactive queries, processing the stream data and integration with Cassandra for huge volume of data.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Written Spark scripts to accept the events from Kafka producer and emit into Cassandra.
- Performed unit testing using JUnit.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Worked on AWS cloud services like EC2, S3, EBS, RDS and VPC.
- Wrote Java code to format XML documents; upload them toSolrserver for indexing.
- Analyzed the data by performing Hive queries on existing database. Designed and Implemented partitioning (Static, Dynamic), Buckets in Hive.
- Created Hive Generic UDF’s to process business logic that varies based on policy.
- Load and transform large sets of structured, semi structured using Hive.
- Involved in creating Hive tables, loading the data and writing Hive queries that will run internally in Map Reduce way.
- Extended Hive and Pig core functionality by writing custom UDFs.
- Used AWS to export MapReduce jobs into Spark RDD transformations.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performed necessary transformations and aggregations to build the data model and persists the data in HDFS.
- Implemented intermediate functionalities like events or records count from theKafka topics by writing Spark programs in Java and Scala.
- Worked on Cassandra in creating Cassandra tables to load large set of semi structured data coming from various sources.
- Involved in Cassandra Data modelling to create key spaces and tables in multi Data Center DSE Cassandra DB.
- Ingested the data from Relational databases such as MySQL, Oracle, and DB2 to HDFS using Sqoop.
- Involved in identifying job dependencies to design workflow for Oozie and YARN resource management.
- Worked on Talend with Hadoop and improved the performance of the Talend jobs.
- Designed and developed various SSIS packages (ETL) to extract, transform data & involved in scheduling SSIS packages.
- Set up solr for searching and routing the log data.
- Extensively used Zookeeper as job scheduler forSpark jobs.
- Added security to the cluster by integrating Kerberos.
- Understanding of Kerberos authentication in Oozie workflow for Hive and Cassandra.
- Utilized Container technology like Docker along with Mesos and aurora to manage whole cluster of hosts.
- Created Tableau visualization for the internal management.
- Experience with different kind of compression techniques such as LZO, GZip, Snappy.
- Dealt with Jira as ticket tracking and work flow.
- Involved in sprint planning, code review and daily standup meetings to discuss the progress of the application.
- Effectively followed Agile Scrum methodology to design, develop, deploy and support solutions that leverage the client big data platform.
Environment: Apache Spark, Scala, Hive, Cloudera, Apache Kafka, Sqoop, Cassandra, MySQL, Oracle, DB2, Spark Streaming, Java, Python, Agile, Talend, AWS-(EC2, S3, EBS, RDS, VPC), ETL, Tableau, Kerberos, Jira, Mesos, Solr.
Confidential, Cincinnati, OH
- Handled importing of data from various data sources into HDFS, and performed transformations using Hive.
- Continuous monitoring and managing theHadoop cluster through HDP (Hortonworks Data Platform).
- Used Flume to stream through the log data from various sources.
- Configured Flume to extract the data from the web server output files to load into HDFS.
- Involved in loading data from UNIX/LINUX file system and FTP to HDFS.
- Developed and implemented Hive queries and functions for evaluation, filtering, and sorting of data.
- Analyzed the data by performing Hive queries and running Pig scripts.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Handled different type of Joins in Hive such as Inner Join, Left outer join, Right Outer Join, and Full Outer Join.
- Involved in developing Impala scripts to do Adhoc queries.
- Defined Accumulo tables and loaded data into tables for near real-time reports.
- Created Hive external tables using Acccumulo connector.
- Developed Simple to complex Map/reduce Jobs using Hive, Pig and Python.
- Optimized the Hive queries using Partitioning and Bucketing techniques for controlling the data distribution.
- Supported the existing MapReduce Programs those are running on the cluster.
- Worked with Linux systems and RDBMS database on a regular basis to ingest data using Sqoop.
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
- Developed Pig scripts and UDF's as per the Business logic.
- Worked on No-SQL database MongoDB in storing images and URIs.
- Managed and reviewed HadoopMongoDB log files.
- Performed data analysis with MongoDB using Hive External tables. Exported the analyzed data using Sqoop and to generate reports for the BI team.
- Set up Elastic Search and Logstash for searching and routing the log data.
- Designed and implemented Spark jobs to support distributed data processing.
- Worked on NiFi to automate the data movement between different Hadoop systems.
- Designed and implemented custom NiFi processors that reacted, processed for the data pipeline.
- Designed Cluster co-ordination services through Zookeeper.
- Used Amazon DynamoDB to gather and track the event based metrics.
- Involved in ETL process for design, development, testing and migration to production environments.
- Involved in writing the ETL test scripts and guided the testing team in executing the test scripts.
- Worked on MongoDB for distributed storage and processing.
- Used Hive to analyze partioned and bucketed data and compute various metrics for reporting.
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- Followed Agile Scrum methodology for the entire project.
- Used Remedy for tracking the work flow and raising the requests.
Environment: HDFS, Flume, Sqoop, Hive, Pig, Oozie, Python, Shell Scripting, SQL, MongoDB, DynamoDB, Linux, Unix, NiFi, AWS (EC-2, VPC), Talend, ETL, Elastic Search, Logstash, Zookeeper, Hortonworks, Agile Scrum, Remedy.
Confidential, Cincinnati, OH
- Developed solutions to process data into HDFS, process within Hadoop and emit the summary results from Hadoop to downstream systems.
- Installed and configured Hadoop MapReduce, developed multiple MapReduce jobs for cleansing and preprocessing.
- Worked and written Hadoop MapReduce jobs to run on Amazon EMR clusters and creating workflows for running jobs.
- Worked on Sqoop extensively to ingest data from various source systems into HDFS.
- Imported data from different relational data sources like Oracle, MySQL to HDFS using Sqoop.
- Analyzed substantial data steps using Hive queries and Pig scripts.
- Written Pig scripts for sorting, joining, and grouping data.
- Integrated multiple sources of data (SQL Server, DB2, MySQL) into Hadoop cluster and analyzed data by Hive-HBase integration.
- Involved in writing optimized Pig Script along with developing and testing Pig Latin Scripts.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc.
- Played a major role in working with the team to leverage Sqoop for extracting data from Oracle.
- Solved small file problem using Sequence files processing in MapReduce.
- Implemented counters on HBase data to count total records on different tables.
- Created HBase tables to store variable data formats coming from different portfolios. Performed real time analytics on HBase using Java API and Rest API.
- Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
- Experienced with different scripting language like Python and shell scripts
- Oozie and were used to automate the flow of jobs and coordination in the cluster respectively.
- Worked on different file formats like Text files, Parquet, Sequence Files, Avro, Record columnar files (RC).
- Understood complex data structures of different type (structured, semi structured) and de-normalizing for storage in Hadoop.
- Experienced with working on Avro Data files using Avro Serialization system.
- Kerberos security was implemented to safeguard the cluster.
Environment: HDFS, Pig, MapReduce, Sqoop, Oozie, Zookeeper, HBase, Java Eclipse, Python, MySQL, Oracle, Shell Scripting, Kerberos, EMR, Oozie, Zookeeper, EMR, SQL Server, DB2, MySQL.