- Around 8 years of technical expertise in complete software development life cycle (SDLC), which includes 4 years of Hadoop Development, Administration and 4 years of Core Java Development and Testing.
- Hands on experience working with Apache Spark with python and Scala and Hadoop ecosystems like MapReduce (MRv1 and YARN ), Sqoop, Spark , Hive, Oozie, Flume, Kafka, Zookeeper and NoSQL Databases like HBase.
- Excellent knowledge on Spark Core architecture.
- Hands on expertise in writing different RDD (Resilient Distributed Datasets) transformations and actions using Scala, Python and Java.
- Created Data Frames and performed analysis using Spark SQL.
- Knowledge on Spark Streaming and Spark Machine Learning Libraries.
- Worked on File Formats like Sequence files, performed serialization of data on AVRO, parquet file formats and Managing Hadoop log files.
- Used Sqoop to Import data from Relational Database (RDBMS) into HDFS and Hive, storing using different formats like Text, Avro, Parquet, Sequence File, ORC File along with compression codecs like Snappy and Gzip.
- Performed transformations on the imported data and Exported back to RDBMS.
- Experience in writing queries in HQL (Hive Query Language), to perform data analysis.
- Created Hive External and Managed Tables.
- Implemented Partitioning and Bucketing on Hive tables for Hive Query Optimization.
- Developed ACID property transactions like Insert, Delete, Update.
- Worked on Conversion of Parquet file into ORC file format to achieve high efficiency while storing and processing.
- Experienced in writing Oozie workflows and coordinator jobs to schedule sequential Hadoop jobs.
- Used Apache Flume to ingest data from different sources to sinks like HDFS.
- Implemented custom interceptors for flume to filter data and defined channel selectors to multiplex the data into different sinks.
- Excellent knowledge on Kafka Architecture.
- Integrated Flume with Kafka, using Flume both as a producer and consumer (concept of FLAFKA).
- Used Kafka for activity tracking and Log aggregation.
- Good understanding of Relational Databases like MySQL.
- Ability to write complex SQL queries to analyze structured data.
- NoSQL database, HBase. Worked on table row key design and to load and retrieve data for real time data processing using HBase API and performance improvements based on data access patterns.
- Knowledge on other NoSQL databases like MongoDB, Cassandra.
- Experienced in using GIT, Code Cloud, SVN.
- Ability to deal with build tools like Apache Maven, SBT.
- Excellent knowledge of Object - Oriented analysis and design. Very good at analyzing the user requirements and using the design patterns.
- Designed and developed Java enterprise and web applications using Java, JDBC API.
- Utilized the concepts of Multi-threaded programming in developing applications.
- Implemented unit test cases and documented all the code and applications.
Java:, Scala, Linux, AWS
Hadoop: Apache Spark, MapReduce, Hive, Scala, Python, Kafka, Flume, Airflow, Oozie, Sqoop, Jenkins.
Distributions: Cloudera, Horton Works, Amazon EMR, Google DataProc.
Confidential, Chicago, IL
Sr Hadoop Developer
- Responsible for implementing a generic framework to handle different data collection methodologies from the client primary data sources, validate, transform using Spark (Scala) and load into S3.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Developed Spark Code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Migrated historical data to S3 and developed a reliable mechanism for processing the incremental updates.
- Used Airflow workflow engine to manage independent Hadoop jobs and to automate several types of Hadoop such as java MapReduce, Hive and Sqoop as well as system specific jobs.
- Set up different kinds of Build triggers including - Gated -Checking, Continuous Integration. Responsible for Continuous Integration (CI) and Continuous Delivery (CD) process implementation using Jenkins along with Shell scripts to automate routine jobs.
- Coordinate/assist to establishing and applying appropriate branching, labeling/naming conventions using Subversion (SVN) and Code Cloud source control.
- Work with IT marketing analytics to assist with data-related technical issues and provide support
- Created and populated bucketed tables in Hive to allow for faster map side joins and for more efficient jobs and more efficient sampling. Also performed partitioning of data to optimize Hive queries.
- Implemented Curated Data Store logic using Spark Scala and Data frames concepts.
- Explored the usage of Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL and Spark Yarn.
- Implemented the Spark Scala code for Data Validation in Hive
- Worked on Parquet file format to perform business required jobs.
- Used Spark SQL functions to move data from stage Hive tables to fact and dimension tables.
- Implemented dynamic partitioning in hive tables and used appropriate file format, compression technique to improve the performance of hive jobs.
- Worked with Data Engineering Platform team to plan and deploy new Hadoop Environments and expand existing Hadoop clusters.
Tools: and Technologies: Amazon EMR, Amazon S3, Spark with python and Scala, Hive, Sqoop, Airflow, Design Patterns, SFTP, Code Cloud, Jira, Bash.
Confidential, Chicago, IL
Sr Hadoop Developer
- Worked with Apache Kafka to get data from web servers through Flume.
- Leveraged Flume to stream data from Spool Directory source to HDFS Sink using AVRO protocol.
- Developed Scala scripts to parse clickstream data.
- Developed Pig UDFs for processing complex data making use of Eval, Load and Filter Functions.
- The Hive tables created as per requirement were internal or external tables defined with appropriate static and dynamic partitions, intended for efficiency.
- Implemented Hive queries using indexes and bucketing for time efficiency.
- Used the JSON and Avro SerDe’s for serialization and de-serialization packaged with Hive to parse the contents of streamed data.
- Implemented Oozie Coordinator to schedule the workflow, leveraging both data and time dependent properties.
- Worked closely with BI and Data Science teams to gather requirements on data.
- Debugging and troubleshooting the issues in MapReduce development using Test environments like MRUnit and JUnit.
- Used Git as Version Control System and extensively used Maven as build tool.
- Implemented Batch Data Import and worked on Stream processing using Spark Streaming.
- Developed this project using Spark in YARN mode and in-depth knowledge on Standalone mode.
- Created RDDs on the log files and converted them to Data Frames.
- Developed Spark SQL queries to perform analysis on the log data.
- Used Hive Context to connect with Hive Metastore and write HQL queries.
Tools: and Technologies: Cloudera Manager (CDH5.10), MapReduce, HDFS, Sqoop, Pig, Hive, Oozie, Kafka, flume, Java, Git, Maven, Jenkins, Eclipse, Maven.
Confidential, Minneapolis, Minnesota.
- Hands on experience in loading data from UNIX file system to HDFS.
- Transfer of data from landing zone to the HDFS file system using Sqoop.
- Experienced on loading and transforming of large sets of structured and semi structured data from HDFS through Sqoop and placed in HDFS for further processing.
- Designed appropriate partitioning/bucketing schema to allow faster data retrieval during analysis using HIVE.
- Involved in processing the data in the Hive tables using HQL high-performance, low-latency queries.
- Transferred the analyzed data across relational database from HDFS using Sqoop enabling BI team to visualize analytics.
- Developed custom aggregate functions using Spark SQL and performed interactive querying.
- Managing and scheduling Jobs on a Hadoop cluster using Airflow DAG.
- Involved in creating Hive tables, loading data and running hive queries in those data.
- Extensive working knowledge of partitioned table, UDFs, performance tuning, compression related properties in Hive.
- Work with Data Engineering Platform team to plan and deploy new Hadoop Environments and expand existing Hadoop clusters.
- Deploy Informatica objects in production repository.
- Monitor and debug Informatica components in case of failure or performance issues.
Tools: and Technologies: CDH5.10, Spark with python and Scala, Hive, Sqoop, Airflow, Design Patterns, SFTP, Code Cloud, Jira, Bash, UNIX.
- Hands on experience in loading data from UNIX file system to HDFS. Also performed parallel
- Worked on improving the performance of existing Pig and Hive Queries.
- Developed Oozie workflow engines to automate Hive and Pig jobs.
- Worked on performing Join operations.
- Exported the result set from Hive to MySQL using Sqoop after processing the data.
- Analyzed the data by performing Hive queries and running Pig scripts to study customer behavior.
- Used Hive to partition and bucket data.
- Performed various source data ingestions, cleansing, and transformation in Hadoop.
- Design and developed many Spark Programs using Scala.
- Produce unit tests for Spark transformations and helper methods.
- Creating RDD's and Pair RDD's for Spark Programming.
- Implement Joins, Grouping and Aggregations for the Pair RDD's.
- Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Developed Pig Scripts to perform ETL procedures on the data in HDFS.
- Analyzed the partitioned and bucketed data and compute various metrics for reporting.
- Created HBase tables to store various data formats of data coming from different systems.
- Advanced knowledge in performance troubleshooting and tuning Cassandra clusters.
- Analyzing the source data to know the quality of data by using Talend Data Quality.
- Created Scala/Spark jobs for data transformation and aggregation.
- Involved in creating Hive tables, loading with data and writing hive queries.
- Used Impala to read, write and query the Hadoop data in HDFS from Cassandra and configured Kafka to read and write messages from external programs.
- Preparation of Technical architecture and Low-level design documents.
- Tested raw data and executed performance scripts.
Tools: and Technologies: HADOOP(CDH-5.10), Spark (Scala), Hive, Pig, Oozie, MySQL, Sqoop, Kafka, Cassandra, UNIX.
- Monitor and debug Informatica components in case of failure or performance issues.
- Responsible to analyze functional specifications and to prepare technical design specifications.
- Involved in all Software Development Life Cycle (SDLC) phases of the project from domain knowledge sharing, requirement analysis, system design, implementation and deployment.
- Developed REST web services for implementing the business logic for different functionalities in the features that are developed.
- Wrote Junit test cases for testing the functionality of the developed web services.
- Involved in writing the SQL queries to fetch data from database.
- Utilized Postman for verifying the smooth workflow of the application, how the application is changing with the newly developed functionalities and also verified the output for the web services.
- Extensively worked on both Enterprise and Community edition of MULE ESB. Experience working with Mule API and Runtime manager and RAML.
- Worked on logging Mechanism Web NMS SNMP API supports logging of the SNMP requests.
- Responsible for the debugging, fixing and testing the existing bugs related to application.
- Developed builds using continuous integration server Jenkins.
- Extensively used GIT for push and pull requests of the code.
- Actively participated in the daily scrum meetings and bi-weekly retro meetings for knowledge sharing.
- Wrote DAO classes using spring and Hibernate to interact with database for persistence.
- Used Eclipse for application development.
- Used JIRA as the task and defect tracking system.
- Followed Agile Methodologies to manage the life-cycle of the project.
- Provided daily updates, sprint review reports, and regular snapshots of project progress.
Eco-System: Java, MySQL, Google Web Kit, Spring framework, Hibernate, Eclipse, SVN, Maven, Bugzilla.
- Involved in building and implementing the application using MVC architecture with Java Spring framework.
- Used Hibernate as the Object-Relational mapping framework in order to simplify the transformation of business data between an application and relational database.
- Used Junit as the testing framework. Involved in developing test plans and test cases. Performed unit testing for each module and prepared code documentation.
- Responsible for testing, analyzing and debugging the software.
- Applied design patterns and OO design concepts to improve the existing code base.
- Involved in documentation of the module and project. Involved in providing post-production support.
- Followed Agile Methodologies to manage the life-cycle of the project. Provided daily updates, sprint review reports, and regular snapshots of project progress.