- Overall 9+ years of experiences in IT industry and around 7 years of extensive hands on experience with Hadoop Ecosystem stack including HDFS, MapReduce, Sqoop, Hive, Pig, HBase, Oozie, Flume, Kafka, Zookeeper, and Spark.
- Expertise in different Hadoop distributions like Cloudera and HortonWorks Distributions (HDP).
- Comfortable working with various facets of the Hadoop ecosystem, real - time or batch, Structured or Unstructured data processing.
- Expertise with NoSQL databases like HBase as well as other ecosystems like Zookeeper, Oozie, Impala, Storm, Spark- Streaming/SQL, Kafka, Flume.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
- Experience skills in handling analytics projects using Big Data technologies.
- Experience in ingesting data from external servers to Hadoop.
- Expertise in moving large amounts of log, streaming event data and Transactional data using Flume.
- Experience developing workflows that execute Sqoop, Pig, Hive and Shell scripts using Oozie.
- Designed and Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data.
- Experience with Hive concepts like Static/Dynamic Partitioning, Bucketing, Managed, and External Tables, join operations on tables.
- Proficient in building user defined functions (UDF) in Hive and Pig, to analyze data and extended HiveQL and Pig Latin Functionality.
- Expertise in working with Spark transformations and actions on RDDs and Spark-SQL, Data Frames in Python.
- Expertise in implementing unified data ingestion platform using Kafka producers and consumers.
- Expertise in implementing near real-time event processing and analytics using Spark Streaming
- Proficient with Flume topologies for data ingestion from streaming sources into Hadoop.
- Well versed with major Hadoop distributions: Cloudera and HortonWorks
- Having good experience on Eclipse, NetBeans IDEs.
- Ability to adapt to evolving Technology, Strong sense of responsibility and Accomplishment.
- Having good experience with Agile Methodology.
- Strong experience in distinct phases of Software Development Life cycle (SDLC) including Planning, Design, Development and Testing during the development of software applications.
- Excellent leadership, interpersonal, problem solving and time management skills.
- Excellent communication skills both written (documentation) and verbal (presentation).
- Very responsible and good team player. Can work independently with minimal supervision.
Programming Languages: Java (core), J2EE, UNIX Shell Scripting, Python
Web Languages: HTML, JAVA SCRIPT, CSS
Hadoop Ecosystem: MapReduce, HBASE, HIVE, PIG, SQOOP, Zookeeper, OOZIE, Flume, HUE, Kafka, AWS EMR, SPARK, SPARK-SQL
Database Languages: MySQL, NOSQL
Database: Oracle, SQL
Virtualization & Cloud Tools: Amazon AWS, VMware, Virtual box
Visualization tools: Power Bi, Tableau
Web/Application Servers: Apache Tomcat
Version Control Tools: GIT and SVN
Operating Systems: Windows, Linux (Ubuntu, Red Hat, Cent OS)
IDE Platforms: Eclipse, Net Beans, Visual Studio
Methodologies: Agile, SDLC
Confidential, Alpharetta, GA
Sr. Hadoop Developer
- Assess current and future ingestion requirements, review data sources, data formats and recommend processes for loading data into Hadoop.
- Developed ETL Applications using HIVE, SPARK, IMPALA & SQOOP and Automated using Oozie workflows and Shell scripts with error handling Systems and scheduled using Autosys.
- Built Sqoop jobs to import massive amounts of data from relational databases - Teradata & Netezza -and back-populate on Hadoop platform.
- Working on creating a common workflow to convert from EBCDIC format to ASCII from the Mainframe sources to a delimited file in the Avro format to HDFS.
- Worked on Avro and Parquet File Formats with snappy compression.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users
- Developing scripts to perform business transformations on the data using Hive and PIG.
- Creating Impala views on top of Hive tables for faster access to analyze data through HUE/TOAD.
- Connected Impala with different BI tools like TOAD and Sql Assistant to help modeling team to run the different RISK models.
- Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Experience working on Spark programs using Scala and Spark SQL for developing Business Reports.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
- Developing Bteq scripts for moving data from staging table to final tables in Teradata as part of automation.
- Support architecture, design review, code review, and best practices to implement a Hadoop architecture.
Environment: Cloudera (CDH4/CDH5), HDFS, Map Reduce, Hive, Pig, Sqoop, Oozie, Impala, Spark, Kafka, Teradata, Linux, Java, Eclipse, SQL Assistant, TOAD
Confidential, St.Louis, Mo
- Creating end to end Spark applications using Scala to perform various data cleansing, validation, transformation and summarization activities on user behavioral data .
- Developed custom Input Adaptor utilizing the HDFS File system API to ingest click stream log files from FTP server to HDFS .
- Developed end-to-end data pipeline using FTP Adaptor, Spark, Hive and Impala .
- Used Scala to write code for all Spark use cases .
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.
- Implemented design patterns in Scala for the application .
- Implemented Spark using Scala utilized Spark SQL heavily for faster development, and processing of data .
- Exploring with Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, YARN .
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Java and Scala .
- Used Scala collection framework to store and process the complex consumer information.
- Implemented a prototype to perform Real time streaming the data using Spark Streaming with Kafka .
- Imported the data from different sources like AWS S3, Local file system into Spark RDD.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.
- Handled importing other enterprise data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HBase tables.
- Exported the analyzed data to the relational databases using Sqoop, to further visualize and generate reports for the BI team.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Analyzed the data by performing Hive queries (Hive QL) and running Pig scripts (Pig Latin) to study customer behavior .
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.
- Implemented Data Ingestion in real time processing using Kafka .
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting .
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data; Worked under MapR Distribution and familiar with HDFS
- Created components like Hive UDFs for missing functionality in HIVE for analytics.
- Worked on various performance optimizations like using distributed cache for small datasets, Partition, Bucketing in Hive and Map Side joins.
- Created validate and maintain scripts to load data using Sqoop manually .
- Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive .
- Used Oozie and Oozie coordinators to deploy end-to-end data processing pipelines and scheduling the workflows .
- Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snow-Flake.
- Continuous monitoring and managing the Hadoop cluster.
- Used JUnit framework to perform Unit testing of the application.
- Developed interactive shell scripts for scheduling various data cleansing and data loading process .
- Performed data validation on the data ingested using Spark by building a custom model to filter all the invalid data and cleanse the data.
- Experience with data wrangling and creating workable datasets .
Environment: HDFS, Pig, Hive, Sqoop, Flume, Spark, MapReduce, Scala, Tableau, Oozie, Cassandra, YARN, UNIX Shell Scripting, Agile Methodology
Confidential, Phoenix, AZ
- Automation of data pulls into HDFS from MySQL server and Oracle DB using Sqoop.
- Analyzing source data tables for best possible loading strategies.
- Involved in various stages of this project like planning, estimation the hardware and software, installing (SDLC).
- Develop Shell scripts to perform various ETL jobs like creating staging and final tables.
- Implemented 2 level staging process for Data Validation.
- Extracted data from staging tables and analyzed data using Impala.
- Implement ad-hoc queries using Impala, create tables with partitioning and bucketing to load data.
- Created a Spark application to process and stream data from Kafka to MySQL.
- Implement Hive Incremental updates using four-step strategy to load incremental data from RDBM systems.
- Implement, configure optimization techniques like Bucketing, Partitioning and File Formats.
- Used Spark to analyze data in HIVE, HBase and HDFS.
- Involved in Hadoop Cluster Administration that includes adding and removing Cluster Nodes, Cluster Capacity Planning, and Performance Tuning.
- Worked on Hadoop clusters capacity Planning and Management.
- Monitoring and Debugging Hadoop jobs Applications running in production.
- Written a PIG Scripts to read data from HDFS and write into Hive Table.
- Experience of performance tuning Hive Scripts, Pig Scripts, MR Jobs in production environment by altering job parameters.
- Providing various hourly/weekly/monthly aggregation reports required by clients through Spark.
- Worked on data processing part mainly to make the Unstructured Data to Semi-Structured Data and loaded into Hive tables, HBase tables and integration.
- Load log data into HDFS using Flume.
- Written the Apache PIG scripts to process the HDFS Data.
- Developed Spark SQL scripts with Python for analysis and Demo purposes.
Environment: MapReduce, Spark, HDFS, Pig, HBase, Oozie, Zookeeper, Sqoop, Linux, Kafka, Hadoop, Maven, NoSQL, MySQL, Hive, Java, Eclipse, Python.
Confidential, Phoenix, AZ
- Involved in Requirements analysis, design, and development and testing.
- Involved in setting up the different roles & maintained authentication to the application.
- Designed, deployed and tested Multi-tier application using the Java technologies.
- Involved in front end development using JSP, HTML & CSS.
- Implemented the Application using Servlets
- Deployed the application on Oracle Web logic server
- Implemented Multithreading concepts in java classes to avoid deadlocking.
- Used MySQL database to store data and execute SQL queries on the backend.
- Prepared and Maintained test environment.
- Tested the application before going live to production.
- Documented and communicated test result to the team lead on daily basis.
- Involved in weekly meeting with team leads and manager to discuss the issues and status of the projects.
Environment: J2EE (Java, JSP, JDBC, Multi-Threading), HTML, Oracle Web logic server, Eclipse, MySQL, Junit.