- Around 9 years of professional experience in IT industry, involving 6 years of experience with Big Data tools in developing applications using Apache Hadoop/Spark echo systems.
- Experience in Hadoop Ecosystem components like HDFS, Map Reduce, Spark, Hive, Pig, Sqoop, HBase, Kafka and Oozie for Data Analytics.
- Experience in writing Spark applications using Python and Scala.
- Experienced in NoSQL DBs like HBase, MongoDB and Cassandra and wrote advanced query and sub - query.
- Developed Scala UDF’S to process the data for analysis.
- Experienced in real time analytics with Apache Spark RDD, Data Frames and Streaming API.
- Responsible for writing MapReduce programs.
- Scheduled all Hadoop/Hive/Sqoop/HBase jobs using Oozie.
- Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase.
- Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
- Implemented Spark Streaming jobs in Scala by developing RDD's (Resilient Distributed Datasets) and used Pyspark and Spark-shell accordingly.
- Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
- Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
- Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
- Created templates and wrote Shell scripts (Bash), Ruby, Python and PowerShell for automating tasks.
- Good knowledge and hands on Experience in monitoring tools like Splunk, Nagios.
- Experience in using JSP, Servlets, Struts, Java Beans, Apache Tomcat, Web Logic, Web Sphere, JBoss, JDBC, RMI, Ajax, Unix, WSDL, XML, AWS and Vertica, Spring, Hibernate, Angular JS and JMS.
- Gathered and defined functional and UI requirements for software applications
- Practiced Agile Scrum methodology, contributed to TDD, CI/CD and all aspects of SDLC Experience in Developing and maintaining applications on the AWS platform
- Experienced in deploy to Integrate with multiple build systems and to provide an application model handling multiple project.
- Hands on experience with integrating Rest API to cloud environment to access resources.
Hadoop Technologies: Apache Hadoop, Cloud era Hadoop Distribution (HDFS and Map Reduce) Technologies HDFS, YARN, MapReduce, Hive, Pig, Sqoop, Flume, Spark, Kafka, Zookeeper, and Oozie
Java/J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts.
NOSQL Databases: Hbase, MongoDB
Programming Languages: Java, Scala, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting
Application Servers: Web Logic, Web Sphere, JBoss
Cloud Computing tools: Amazon AWS.
Build Tools: Jenkins, Maven, ANT
Databases: MySQL, Oracle, DB2
Business Intelligence Tools: Splunk, Talend
Development Methodologies: Agile/Scrum, Waterfall.
Development Tools: Microsoft SQL Studio, Toad, Eclipse, NetBeans.
Operating Systems: WINDOWS, MAC OS, UNIX, LINUX.
Sr BigData/Hadoop Developer
- Used Spark-SQL to read the parquet data and create the tables in Hive using the Scala API.
- Developed Spark Streaming Jobs in Scala to consume data from Kafka Topics, made transformations on data and inserted to HBase.
- Real time streaming, performing transformations on the data using Kafka and Kafka Streams.
- Built NiFi dataflow to consume data from Kafka, make transformations on data, place in HDFS & exposed port to run Spark streaming job.
- Worked on NoSQL databases including HBase and MongoDB. Configured MySQL Database to store Hive metadata.
- Strong understanding of AWS components such as EC2 and S3
- Utilized Apache Hadoop environment by Cloudera Distribution.
- Developed data pipelines using Stream sets Data Collector to store data from Kafka into HDFS, Solr, Elastic Search, HBase and MapR DB.
- Deployment and administration of Splunk and Hortonworks Distribution.
- Configured ZooKeeper, Cassandra & Flume to the existing Hadoop cluster.
- Automated all the jobs for pulling data from FTP server to load data into Hive tables, using Oozie workflows.
- Working extensively on Hive, SQL, Scala, Spark, and Shell.
- Developed complex queries using Hive and Impala.
- Worked in Spark streaming to get ongoing information from the Kafka and store the stream information to HDFS.
- Developed MapReduce jobs on Yarn and Hadoop clusters to produce daily and monthly reports.
- Developed a workflow using Oozie to automate the tasks of loading the data into HDFS from analyzing the data.
- Created Airflow Scheduling scripts in Python
- Wrote various data normalization jobs for new data ingested into Redshift
- Advanced knowledge on Confidential Redshift and MPP database concepts.
- Worked with Python to develop analytical jobs using PySpark API of Spark.
- Experience with Pyspark for using Spark libraries by using Python scripting for data analysis.
- Developed Python scripts to import export data from relational sources and handled incremental loading on the customer, transaction data by date.
- Streaming events from HBase to Solr using Lily HBase Indexer.
- Loaded data from csv files to Spark, created data frames and queried data using Spark SQL.
- Created external tables in Hive, Loaded JSON format log files and ran queries using HiveQL.
- Created Unix Shell Script for running PySpark scripts which used to load data into hive tables and which captures execution time.
- Created Teradata schemas with constraints, Created Macros in Teradata. Loaded the data using Fast load utility. Created functions and procedures in Teradata.
- Setup of Hadoop Cluster on AWS, which includes configuring different components of Hadoop.
- Written the AWS Lambda functions in Python to automatically trigger the AWS EMR using the latest AMI’s with all the functional dependencies.
- Designed HBase row key and Data-Modeling of data to insert to HBase Tables using concepts of Lookup Tables and Staging Tables.
- Captured the Metrics with Kibana, Logstash and Elastic search for Logs, Used Grafana for Monitoring.
- Extensively used Microservices and Postman for hitting the Kubernetes DEV and Hadoop clusters.
- Experience in developing Docker images and deploying Docker containers in swarm.
- Worked with MapR, Cloudera and Hortonworks platforms as part of a Proof of concept.
- Performed Raw Data Ingestion into Amazon RDS from AWS S3 using Apache Spark Framework on AWS EMR
- As Part of POC setup Amazon web services (AWS) to check whether Hadoop is a feasible solution or not.
- Created and maintained technical documentation for launching Hadoop clusters and for executing Hive queries and Pig Scripts.
- Also exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, Pair RDDs, Storm, and Spark YARN.
- Utilized Microsoft data bricks to process Spark jobs and blob storage services to process data.
- Worked on the upgrades in AWS environment along with admin team and did the regression testing.
Environment: HDFS, Hadoop, Kafka, MapReduce, Nifi, Pig, AWS, MapReduce, Elastic Search, Spark, Impala, Hive, Pyspark, Cloudera, Avro, Parquet, Grafana, Scala, JAVA, HBase, Cassandra, Horton Works, Zoo Keeper, Microsoft Azure, Azure Data Lake, Azure Blog Storage.
Confidential, Plainsboro, NJ
- Developed Spark Applications by using Scala, Java and implemented Apache Spark data Processing Project to handle data from various RDBMS and streaming sources.
- Used Kafka consumer’s API in Scala for consuming data from Kafka topics.
- Created Sqoop job to bring the data from Oracle to HDFS and created external Hive tables in Hive.
- Knowledge on Pyspark and used Hive to analyze sensor data and cluster users based on their behavior in the events.
- Created External Tables in Hive and saved in ORC file format.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Built data pipeline using Pig to store onto HDFS.
- Worked on HiveQL for data analysis for importing the structured data to specific tables for reporting.
- Wrote Python scripts to parse XML documents and load the data in database.
- Experience in working with Hive to create Value Added Procedures. Also wrote Hive UDF to make the function reusable for different models.
- Loaded the dataset into Hive for ETL (Extract, Transfer and Load) operation.
- Implemented Kafka model which pulls the latest records into Hive external tables.
- Involved in developing code to write canonical model JSON records from numerous input sources to Kafka Queues.
- Developed a Spark Script Apache Nifi, to do the source to target mapping according to the design document developed by designers.
- Worked extensively on AWS components like Elastic MapReduce (EMR), Elastic Compute Cloud (EC2), and Simple Storage Service (S3).
- Used Amazon Cloud Watch to monitor and track resources on AWS.
- Developed Data frames for data transformation rules.
- Deployed various Microservices like Spark, MongoDB, Cassandra in Kubernetes and Hadoop clusters using Docker.
- Developed Spark SQL queries to join source tables with multiple driving tables and created a targeted table in Hive.
- Optimized the code using Pyspark for better performance.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.
- Developed a Spark application to do the source to target mapping.
- Involved in running Hadoop streaming jobs to process terabytes of text data. Worked with different file formats such as Text, Sequence files, Avro, ORC and Parquet.
- Collected the data using Spark streaming and dump into Hbase.
- Experience in Jupyter notebook for Spark SQL and scheduling the cronjobs using Spark submit.
- Developed Python script for start a job and end a job smoothly for a UC4 workflow.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Exposure on usage of Apache Kafka develop data pipeline of logs as a stream of messages using producers and consumers.
- Developed Python scripts to clean the raw data.
- Experienced in writing Spark Applications in Scala and Python.
- Developed and analyzed the SQL scripts and designed the solution to implement using Pyspark.
- Leveraged cloud and GPU computing technologies for automated machine learning and analytics pipelines, such as AWS, GCP.
- Experienced in Cauterize NiFi Pipeline on EC2 nodes integrated with Spark, Kafka, and Postgres.
- Fetch and generate monthly reports. Visualization of those reports using Tableau.
Environment: Hadoop, Hive, Spark, Pig, Scala, Kafka, Oozie, Hbase, Hive, HDFS, Hortonworks, Linux, Sqoop, Oracle, Pyspark, shell Scripting, agile methodology, UC4, Airflow, Kafka, Hbase, JIRA, Nifi, Tableau, Jupyter Notebook, AWS Tools (S3, EMR, EC2, Cloud Watch)
- Involved in writing Unix/Linux Shell Scripting for scheduling jobs and for writing PIG scripts and Hive QL.
- Involved in Database design and developing SQL Queries, stored procedures on MySQL
- Setting up Hadoop MapReduce, BigData, HDFS and developed multiple MapReduce jobs in Java for data cleansing and preprocessing.
- Performed optimization tasks like using distributed cache for small datasets, partition and bucketing in Hive, doing map side joins etc.
- Used Hive and Impala to query the data in Hbase
- Worked on Spark Data streaming process to parse the XML data files into Hadoop Hive tables as per the requirement.
- Developed a PySpark code for saving data into AVRO and Parquet format and building Hive tables on top of them.
- Created Hive tables and involved in data loading and writing Hive UDFs. Developed Hive UDFs for rating aggregation.
- Migrated on premise database structure to Confidential Redshift data warehouse
- Developed numerous MapReduce jobs in Scala for Data Cleansing and Analyzing Data in Impala.
- Worked on Spark Data streaming process to parse the XML data files into Hadoop Hive tables as per the requirement.
- Experience in creating data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Used EMR (Elastic Map Reducing) to perform BigData operations in AWS
- Developed UI application using AngularJS, integrated with Elastic Search to consume REST.
- Cluster coordination services through Zookeeper
- Used Sqoop to extract data from Oracle SQL server and MySQL databases to HDFS
- Data access framework by Spring is used for automatically acquiring and releasing database resources and exception handling by spring data access hierarchy for better handling of database connections with JDBC.
- Worked with team of Developers and Testers to resolve the issues with the server timeouts and database connection pooling issues.
- Created procedures, macros in Teradata
- Implemented several JUnit test case
- Did web logging application for better trace the data flow on application server using Log4J
- Responded to requests from Technical Team members to prepare a TAR and configured files for Production migration.
Environment: Hadoop, Spark, Scala, Kafka, Linux, Pig, Hive QL, MySQL, Map R, HDFS, Impala, Python, Java, AWS Tools (EMR, EC2, S3), Zookeeper, Sqoop, Junit, Log4j, EMR.
- Analyzing Functional Specifications Based on Project Requirement.
- Ingested data from various data sources into Hadoop HDFS/Hive Tables using SQOOP, Flume, and Kafka.
- Extended Hive core functionality by writing custom UDFs using Java.
- Developing Hive Queries for the user requirement.
- Worked on multiple POCs in Implementing Data Lake for Multiple Data Sources ranging from Team Center, SAP, Workday, Machine logs.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Worked on MS Sql Server PDW migration for MSBI warehouse.
- Planning, scheduling and implementing Oracle to MS SQL server migrations for AMAT in house applications and tools.
- Worked on Solr Search Engine to index incident reports data and developed dash boards in Banana Reporting tool.
- Integrated Tableau with Hadoop data source for building dashboard to provide various insights on sales of the organization.
- Worked on Spark in building BI reports using Tableau. Tableau was integrated with Spark using Spark-SQL.
- Developed work flows in live Compare to Analyze SAP Data and Reporting.
- Worked on Java development and support and tools support for in house applications.
- Participated in daily scrum meetings and iterative development.
- Search functionality for searching through millions of files of logistics groups.
Environment: Hadoop, Hive, Sqoop, Spark, Kafka, Scala, MS SQL Server PDW, TFS, Java.
- Attending Iteration pre-planning and planning meetings to understand the stories, to discuss the Acceptance criteria and providing the estimates.
- Design and Development of various modules in Java/J2EE technologies.
- Developing REST Services using Spring framework.
- JAR upgrades in the applications to keep the application standards.
- Extensively used SQL queries, PL/SQL stored procedures & triggers in data retrieval and updating of information in the Oracle database using JDBC.
- Used Log4J to create log files to debug as well as trace application.
- Involved in Unit Test Cases preparation and Unit Testing using Junit.
- Daily status reporting in Scrum calls.
- Worked on Production support, resolving production issues.
- Coordinating with Configuration Management team on environment challenges.
- New Automation projects /Frame work design and development using Java technologies.
- Integrating new Automation Projects with Jenkins by writing Compilation Ant scripts, Build and deploy scripts, deploying to a Shared NAS and Creating Jenkins job to run Test Scripts in a Testing Slave.
- Deployments to Dev, QA, Staging and Production Environments.
Environment: Java, J2EE, Rest, spring, SQL Server, TFS, Jenkins, Tomcat, Linux, Service now.