- 9 years of professional IT experience including 5 years of experience on Big Data Hadoop Development and Data Analytics, Development and Design of Java based enterprise applications.
- Very strong knowledge on Hadoop ecosystem components like HDFS, MapReduce, Spark, Hive, Pig, Sqoop, Scala, Impala, Flume, Kafka, Oozie and HBase.
- Strong knowledge on Architecture of Distributed systems and Parallel processing frameworks.
- In - depth understanding of Spark execution model and internals of MapReduce framework.
- Expertise in developing production ready Spark applications utilizing Spark-Core, Data-frames, Spark-SQL, Spark- ML and Spark-Streaming API’s.
- Experience in different Hadoop distributions like Cloudera (Cloudera distribution CDH3, 4 and 5) Horton Works Distributions (HDP).
- Worked extensively in fine tuning resources for long running Spark Applications to utilize better parallelism and executor memory for more caching.
- Strong experience working with both batch and real-time processing using Spark frameworks.
- Proficient knowledge on Apache Spark and programming Scala to analyze large datasets using Spark and Storm to process real time data.
- Experience in developing Pig Latin Scripts and using Hive Query Language.
- Strong knowledge on performance tuning Hive queries and troubleshooting various issues related to Joins, memory exceptions in Hive.
- Very good understanding of Partitions, bucketing concepts in Hive and designed both internal and external tables in Hive to optimize performance.
- Strong experience using different columnar file formats like Avro, RCFile, ORC and Parquet formats.
- Hands on experience in installing, configuring and deploying Hadoop distributions in cluster environments (Amazon Web Services).
- Experience in optimizing Map-Reduce algorithms by using Combiners and custom practitioners.
- Experience in NoSQL Column - Oriented Databases like HBase, Apache Cassandra, MongoDB and its Integration with Hadoop cluster.
- Expertise in back-end / server- side java technologies such as: Web services, java persistence API (JPA), Java Messaging Service (JMS), Java Database Connectivity (JDBC).
- Experienced with different scripting language like Python and Shell Scripts.
- Experienced data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Extensive experience in ETL process consisting of data transformation, data sourcing, mapping, conversion and loading.
- In-depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, MapReduce programming paradigm.
- Worked with Sqoop to move (import / export) data from a relational database into Hadoop.
- Knowledge in UNIX Shell Scripting for automating deployments and other routine tasks.
- Experienced in using agile methodologies including extreme programming, SCRUM and Test- Driven Development (TDD).
- Used custom Serdes like Regex Serde, JSON Serde, CSV Serde etc.., in hive to handle multiple formats of data.
- Intensive work experience in developing enterprise solutions using Java, J2EE, Servlets, JSP, JDBC, Struts, Spring, Hibernate, JavaBeans, JSF, MVC.
- Experience in building and deploying web applications in multiple applications servers and middleware platforms including Web logic, Web Sphere, Apache Tomcat, JBoss.
- Experience in using version control tools like Bit-Bucket, GIT, SVN etc.
- Experience in writing build scripts using MAVEN, ANT and Gradle.
- Flexible, enthusiastic and project-oriented team player with excellent communication skills with leadership abilities to develop creative solutions for challenging requirement of client.
Big Data Ecosystems: HDFS, MapReduce, YARN, Hive, Storm, Sqoop, Pig, Spark HBase, Impala, Scala, Flume, Zookeeper, Oozie
NO SQL Databases: HBase, Cassandra, MongoDB
AWS technologies: Data Pipeline, Redshift, EMR
Languages: Java, Scala, Python, SQL, Pig Latin, HiveQL, Shell Scripting.
Database: Microsoft SQL Server, MySQL, Oracle, DB2
Web/Application Servers: Web logic, Web Sphere, JBoss, Tomcat
IDE’s & Utilities: Eclipse, JCreator, NetBeans
Operating Systems: UNIX, Windows, Mac, LINUX
Data Visualization tools: Tableau, Power BI, Apache Zeppelin
Development Methodologies: Agile, V-Model, Waterfall Model, Scrum
Sr. Hadoop/Spark Developer
- Examined transaction data, identified outliers, inconsistencies and manipulated data to insure data quality and integration.
- Developed data pipeline using Sqoop, Spark and Hive to ingest, transform and analyze operational data.
- Used Spark SQL with Scala for creating data frames and performed transformations on data frames.
- Developed custom multi-threaded Java based ingestion jobs as well as Sqoop jobs for ingesting from FTP servers and data warehouses.
- Implemented Spark using Scala and utilizing Data Frames and Spark SQL API for faster processing of data.
- Real time streaming the data using Spark and Kafka.
- Worked on troubleshooting spark application to make them more error tolerant.
- Worked on fine-tuning spark applications to improve the overall processing time for the pipelines.
- Wrote Kafka producers to stream the data from external rest API’s to Kafka topics.
- Wrote Spark-Streaming applications to consume the data from Kafka topics and write the processed streams to HBase.
- Experienced in handling large datasets using Spark in Memory capabilities using broadcasts variables in Spark, effective & efficient Joins, transformations and other capabilities.
- Experience with Kafka in understanding and performing thousands of megabytes of reads and writes per second on streaming data.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Worked extensively with Sqoop for importing data from Oracle.
- Experience working for EMR cluster in AWS cloud and working with S3.
- Involved in creating Hive tables, loading and analyzing data using Hive scripts.
- Created Hive tables, dynamic partitions, buckets for sampling and working on them using Hive QL.
- Involved in build applications using Maven and integrated with continuous integration servers like Jenkins to build jobs.
- Created documents for data flow and ETL process using informatica mappings to support the project once it completed in production.
- Performing data migration from Legacy Databases RDBMS to HDFS using Sqoop.
- Perform Tuning and Increase Operational efficiency on a continuous basis.
- Worked on Spark SQL, reading/ Writing data from JSON file, text file, parquet file, schema RDD.
- Worked on POC’s with Apache Spark using Scala to implement spark in project.
Environment: Hadoop YARN, Spark-Core, Spark Streaming, Spark SQL, Scala, Kafka, Hive, Sqoop, Amazon AWS, HBase, Teradata, Power Center, Tableau, Oozie, Oracle, Linux
Confidential, Ann Arbor, MI
Sr Hadoop/ Scala Developer
- Used Cloudera distribution extensively.
- Converted existing MapReduce jobs into Spark transformations and actions using spark Data frames and Spark SQL API’s.
- Developed Spark programs for Batch processing.
- Written new spark jobs in Python to analyze the data of the customers and sales history.
- Worked on Spark SQL and Spark Streaming.
- Worked on different data formats such as JSON, XML and performed machine learning algorithms in Python.
- Worked on reading multiple data formats on HDFS using Scala.
- Creating end to end Spark-Solr applications using Scala to perform various data cleansing, validation, transformation and summarization activities according to the requirement.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Used Kafka to get data from many streaming sources into HDFS.
- Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Good experience in Hive partitioning, Bucketing and Collections perform different types of joins on Hive tables.
- Used slick to query and storing in database in a Scala fashion using the powerful Scala collection framework.
- Created Hive external tables to perform ETL on data that is generated on daily basics.
- Written HBase bulk load jobs to load processed data to HBase tables by converting to files.
- Performed validation on the data ingested to filter and cleanse the data in Hive.
- Created SQOOP jobs to handle incremental loads from RDBMS into HDFS and applied Spark transformations.
- Implemented Spark SQL to access hive tables into spark for faster processing of data.
- Loaded the data into hive tables from spark and used parquet columnar format.
- Developed Oozie workflows to automate and productionize the data pipelines.
Environment: Hadoop, Hive, Flume, Shell Scripting, Java, Eclipse, HBase, Kafka, Spark, Spark Streaming, Python, Oozie, HQL/SQL, Teradata.
Confidential, San Mateo, CA
- Aggregations and analysis done on large set of log data, collection of log data done using custom built Input Adapters and Sqoop.
- Developed MapReduce programs for data extraction, transformation and aggregation.
- Monitor and troubleshoot MapReduce Jobs those are running on the cluster.
- Implemented solutions for ingesting data from various sources and processing the data utilizing Hadoop services like Sqoop, Hive Pig, HBase, MapReduce etc.
- Worked on creating Combiners, Practitioners and Distributed cache to improve the performance of Map Reduce jobs.
- Wrote Pig scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
- Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
- Optimization of MapReduce algorithms using combiners and practitioners to deliver the best results and worked on Application performance optimization for a HDFS cluster.
- Orchestrated many Sqoop scripts, Pig scripts, Hive queries using Oozie workflows and sub workflows.
- Used Flume to collect, aggregate and store the web log data from different sources like web servers and pushed to HDFS.
- Involved in creating Hive tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NOSQL and a variety of portfolios.
- Involved in debugging MapReduce jobs using MRUnit framework and optimizing Map Reduce jobs.
- Involved in troubleshooting errors in Shell, Hive and MapReduce.
- Worked on debugging, performance tuning of Hive & Pig jobs.
- Design and implement Map Reduce jobs to support distributed processing using Map Reduce, Hive and Apache Pig.
- Created Hive external tables on the Map Reduce output before partitioning, bucketing is applied on top of it.
Environment: Hadoop, HDFS, MapReduce, HIVE, Pig, Sqoop, HBase, Oozie, MySQL, SVN, Putty, Zookeeper, UNIX, Shell Scripting, HiveQL, NOSQL database(HBASE), RDBMS, Eclipse, Oracle 11g.