- Above 8+ years of IT experience in software analysis, design, development, testing and implementation of Big Data, Hadoop, NoSQL and Java/J2EE technologies.
- In depth experience and good knowledge in using Hadoop ecosystem tools like MapReduce, HDFS, Pig, Hive, Kafka, Yarn, Sqoop, Storm, Spark, Oozie, and Zookeeper.
- Excellent understanding and extensive knowledge of Hadoop architecture and various ecosystem components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming paradigm.
- Good usage of Apache Hadoop along enterprise version of Cloudera and Hortonworks.
- Good Knowledge on MAPR distribution & Amazon's EMR.
- Good knowledge of Data modeling, use case design and Object - oriented concepts.
- Well versed in installation, configuration, supporting and managing of Big Data and underlying infrastructure of Hadoop Cluster.
- Have been working with AWS cloud services (VPC, EC2, S3, Redshift, Data Pipeline, EMR, DynamoDB, Lambda and SQS).
- Good knowledge on spark components like Spark SQL, MLlib, Spark Streaming and GraphX
- Extensively worked on Spark streaming and Apache Kafka to fetch live stream data.
- Experience in converting Hive/SQL queries into RDD transformations using Apache Spark, Scala and Python.
- Good Knowledge about using Data Bricks Platform, Cloudera Manager and Hortonworks Distribution to monitor and manage clusters.
- Implemented Dynamic Partitions and Buckets in HIVE for efficient data access.
- Experience in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Involved in integrating hive queries into spark environment using Spark Sql.
- Hands on experience in performing real time analytics on big data using HBase and Cassandra in Kubernetes & Hadoop clusters.
- Experience in using Flume to stream data into HDFS.
- Good working experience using Sqoop to import data into HDFS from RDBMS and vice-versa.
- Good knowledge in developing data pipeline using Flume, Sqoop, and Pig to extract the data from weblogs and store in HDFS.
- Created User Defined Functions (UDFs), User Defined Aggregated Functions (UDAFs) in PIG and Hive.
- Good knowledge in using job scheduling and monitoring tools like Oozie and Zookeeper.
- Hands on experience working on NoSQL databases including Hbase, Cassandra, MongoDB and its integration with Hadoop cluster & Kubernetes cluster.
- Proficient with Cluster management and configuring Cassandra Database.
- Extensive experience in developing Pig Latin Scripts and using Hive Query Language for data analytics.
- Good working experience on different file formats (PARQUET, TEXTFILE, AVRO, ORC) and different compression codecs (GZIP, SNAPPY, LZO).
- Build AWS secured solutions by creating VPC with private and public subnets.
- Expertise in configuring Relational Database Service.
- Experience working with JAVA J2EE, JDBC, ODBC, JSP, Java Eclipse, Java Beans, EJB, Servlets.
- Expert in developing web page interfaces using JSP, Java Swings, and HTML scripting languages.
- Experience working with spring and Hibernates frameworks for JAVA.
- Experience in using IDEs like Eclipse, NetBeans and Intellij.
- Proficient using version control tools like GIT, VSS, SVN and PVCS.
- Development experience in DBMS like Oracle, MS SQL Server, Teradata and MYSQL.
- Developed stored procedures and queries using PL/SQL.
- Hands on Experience with best practices of Web services development and Integration (both REST and SOAP).
- Experience in working with build tools like Ant, Maven, SBT, and Gradle to build and deploy applications into server.
- Expertise in Object Oriented Analysis and Design (OOAD) and knowledge in Unified Modeling Language (UML).
- Expertise in complete Software Development Life Cycle (SDLC) in Waterfall and Agile, Scrum models.
Hadoop Ecosystem: Hadoop, HDFS, MapReduce, Hive, Impala, Pig, Sqoop, Oozie, Zena. Zeke Scheduling, Zookeeper, Flume, Kafka, Spark core, Spark Sql, Spark streaming, AWS, Azure Data lake
NoSQL Databases: Hbase, Cassandra, MongoDB Cloud AWS, EC2, EC3, ELK, Azure, Azure SQL Database, Azure SQL Data Warehouse, Azure Analysis Services, Azure Data Lake and Data Factory.
Build Management Tools: Maven, Apache Ant
Java & J2EE Technologies: Core Java, Servlets, JSP, JDBC, JNDI, Java Beans
Languages: C, C++, JAVA, SQL, PL/SQL, PIG Latin, HiveQL, UNIX shell scripting
Frameworks: MVC, Spring, Hibernate, Struts 1/2, EJB, JMS, JUnit, MR-Unit
Version control: Github, Jenkins
IDE and Tools: Eclipse 4.6, Netbeans 8.2, BlueJ
Databases: Oracle 12c/11g, Confidential SQL Server2016/2014, DB2 & MySQL 4.x/5.x
Methodologies: Software Development Lifecycle (SDLC), Waterfall, Agile, STLC (Software Testing Life cycle), UML, Design Patterns (Core Java and J2EE)
Lead Hadoop Developer
- Responsible for building scalable distributed data solutions using Hadoop.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Configured AWS Lambda functions to log the changes in AWS resources.
- Used AWS lambda to run servers without managing them and to trigger to run code by S3 and SQS.
- Used Spark framework to transform the data for final consumption of analytical applications
- Used DataStax Spark-Cassandra connector to load data into Cassandra and used CQL to analyze data from Cassandra tables for quick searching, sorting and grouping.
- Worked on loading data into Spark RDD's, perform advanced procedures like text analytics using in-memory data computation capabilities of Spark to generate the Output response.
- Developed the statistics graph using JSP, Custom tag libraries, Applets and Swing in a multi-threaded architecture.
- Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
- Developed the batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
- Developed data pipeline expending Pig and Java MapReduce to consume customer behavioral data into HDFS for analysis.
- Migrated data into Data Pipeline using Databricks, Spark SQL and Scala.
- Worked extensively with importing metadata into Hive using Scala and migrated existing tables and applications to work on Hive and AWS cloud.
- Executed many performance tests using the Cassandra-stress tool to measure and improve the read and write performance of the cluster.
- Handled large datasets using Partitions, Broadcasts in Spark, Effective & efficient Joins, Transformations and other during ingestion process itself.
- Used Kafka Streams to Configure Spark Streaming to get information and then store it in HDFS.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Performed the migration of Hive and MapReduce Jobs from on-premise MapR to AWS cloud using EMR.
- Partitioned data streams using Kafka, designed and Used Kafka producer API's to produce messages.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
- Worked on AWS SQS to consume the data from S3 buckets.
- Performed tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.
- Ingested data from RDBMS to Hive to perform data transformations, and then export the transformed data to Cassandra for data access and analysis.
- Implemented Spark Scripts using Scala, Spark SQL to access hive tables into Spark for faster processing of data.
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data.
- Integrate visualizations into a Spark application using Databricks.
- Involved in daily Scrum meetings to discuss the development/progress and was active in making scrum meetings more productive.
- Implemented Informatica Procedures and Standards while developing and testing the Informatica objects.
Environment: Hadoop 3.0, Spark 2.1, Cassandra 1.1, Databricks, Kafka 0.9s, JSP, HDFS, AWS, EC2, Hive 1.9, MapReduce, MapR, Java, MVC, Scala, NoSQL
Big Data Developer
- As a Big data Developer involved in Agile Scrum Methodology to help manage and organize a team of 4 developers with regular code review sessions.
- Participated in Code Reviews, Enhancement discussion, maintenance of existing pipelines & systems, testing and bug-fix activities on-going basis.
- Worked closely with the business analysts to convert the Business Requirements into Technical Requirements and prepared low and high level documentation.
- Interacted with ETL Team to understand Ingestion of data from ETL to Azure Data Lake to develop Predictive analytics.
- Built a prototype Azure Data Lake application that accesses 3rd party data services via Web Services.
- Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
- Responsible for estimating the cluster size, monitoring and troubleshooting of the Spark databricks cluster.
- Created various Documents such as Source-To-Target Data mapping Document, Unit Test, Cases and Data Migration Document.
- Imported data from structured data source into HDFS using Sqoop incremental imports.
- Created Hive tables, partitions and implemented incremental imports to perform ad-hoc queries on structured data.
- Worked with Azure Express Route to create private connections between Azure datacenters and infrastructure for on premises and in co-location environment.
- Improving the performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL and Spark YARN.
- Build Data Sync job on Windows Azure to synchronize data from SQL 2012 databases to SQL Azure.
- Developed SQL scripts using Spark for handling different data sets and verifying the performance over Map Reduce jobs.
- Involved in converting MapReduce programs into Spark transformations using Spark RDD's using Scala and Python.
- Supported MapReduce Programs those are running on the cluster and also wrote MapReduce jobs using Java API.
- Wrote complex SQL and PL/SQL queries for stored procedures.
- Used Cloudera Manager for installation and management of Hadoop Cluster.
- Developing data pipeline using Flume, Sqoop, Pig and Java MapReduce to ingest customer behavioral data and financial histories into HDFS for analysis.
- Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
- Integrated Kafka-Spark streaming for high efficiency throughput and reliability
- Worked in tuning Hive & Pig to improve performance and solved performance issues in both scripts .
- Created Azure Event Hubs for Application instrumentation and for User experience or work flow processing.
- Implemented Security in Web Applications using Azure and Deployed Web Applications to Azure.
- Continuous monitoring and managing the Hadoop cluster using Cloudera Manager
Environment: Agile, Hive, MS Sql 2012, Sqoop, Azure Data Lake, Databricks, Storm, Kafka, HDFS, AWS, Data mapping, Hadoop, YARN, MapReduce, RDBMS, Data Lake, Python, Scala, Dynamo DB, Flume, Pig
Confidential - Redmond, WA
Big Data/Hadoop Developer
- Involved in Agile methodologies, daily scrum meetings, spring planning.
- Ingested data into HDFS using Sqoop, and written custom Input Adaptors (Network Adapter, FTP Adapter and S3 Adapter) and analyzed the data using Spark (Data frames and Spark-SQL), and series of Hive scripts to produce summarized results from Hadoop to downstream systems.
- Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake by creating ETL pipelines using Hive.
- Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Used Spark Data frames, Spark-SQL, Spark MLLib extensively.
- Developed RDD's/Data Frames in Spark using Scala and Python and applied several transformation logics to load data from Hadoop Data Lake to Cassandra DB.
- Involved in Hive partitioning, bucketing and perform joins on hive tables and utilizing hive SerDes like REGEX, JSON and AVRO.
- Integrated Kafka with Spark streaming for real time data processing.
- Worked with NoSQL database HBase in getting real time data analytics using Apache Spark with both Scala and Python
- Closely worked with data science team in building Spark MLlib applications to build various predictive models.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
- Supporting data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud performed Export and import of data into s3.
- Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication and Sharding features.
- Involved in designing the row key in HBase to store Text and JSON as key values in HBase table and designed row key in such a way to get/scan it in a sorted order.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts)
- Worked on custom Talend jobs to ingest, enrich and distribute data in Cloudera Hadoop ecosystem.
- Involved in PL/SQL query optimization to reduce the overall run time of stored procedures.
- Creating Hive tables and working on them using HiveQL.
- Designed and Implemented Partitioning (Static, Dynamic) Buckets in HIVE.
- Developed multiple POCs using Pyspark and deployed on the YARN cluster, compared the performance of Spark, with Hive and SQL and Involved in End-to-End implementation of ETL logic.
- Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and Hive tables.
- Worked on Cluster co-ordination services through Zookeeper.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
- Exported the analyzed data to the RDBMS using Sqoop for to generate reports for the BI team.
- Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.
- Creating the cube in Talend to create different types of aggregation in the data and also to visualize them.
Environment: Hadoop, HDFS, Spark, AWS, S3, Scala, Zookeeper, Map Reduce, Hive, Pig, Sqoop, HBase, Cassandra, MongoDB, Tableau, Java, Maven, UNIX Shell Scripting.