- Having 8 years of total IT experience this includes 6+ years of experience in Hadoop and Big data which includes hands on experience in Java/J2EE Technologies.
- Experienced in designing and implementing solutions using Apache Hadoop 2.4.0, HDFS 2.7, MapReduce2, Hbase 1.1, Hive 1.2, Oozie 4.2.0,Tez 0.7.0,Yarn 2.7.0,Sqoop 1.4.6,MongoDB.
- Having knowledge to implement Horton works (HDP 2.3 and HDP 2.1), Cloudera (CDH3, CDH4, CDH5) on Linux.
- Configuring Name - node high availability and Name-node Federation, with additional tool incentive Disaster recovery and Backup activities.
- Improved the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark- SQL, Data Frame, Pair RDD's, Spark YARN.
- Multi-node setup of Hadoop cluster while Performance tuning and benchmarking of Hadoop Cluster
- Security integration, Monitoring, maintenance as well as troubleshooting of Hadoop cluster.
- Good knowledge on Kerberos Security while Successfully Maintained the cluster by adding and removal of nodes.
- Setting up and integrating Hadoop eco system tools - HBase, Hive, Pig, Sqoop etc.
- Familiar with writing Oozie workflows and Job Controllers for job automation - Hive automation.
- Experience in Developing Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
- Well versed in writing Hive Queries and Hive query optimization by setting different queues.
- Experience in Importing and exporting data from different databases like MySQL, RDBMS into HDFS and HBASE using Sqoop.
- Strong knowledge in configuring High Availability for Name Node, Hbase, Hive and Resource Manager.
- Clusters using Apache, Cloudera and MapR.
- Experience in deploying and managing the multi-node development and production Hadoop cluster with different Hadoop components (HIVE, PIG, SQOOP, OOZIE, FLUME, HCATALOG, HBASE, ZOOKEEPER) using Horton works Ambari.
- Expertise in core Java, J2EE, Multithreading, JDBC, Hibernate, Shell Scripting Servlets, JSP, Spring, Struts, EJBs, Web Services and proficient in using Java API's for application development.
- Gained optimum performance with data compression, region splits and manually managed compaction in Hbase.
- Upgrading from HDP 2.1 to HPD 2.2 and then to HDP 2.3. With Good knowledge on cluster monitoring tools like Ganglia and Nagios.
- Working experience in Map Reduce programming model and HDFS.
- In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce concepts.
- Hands on experience on Unix/Linux environments, which included software installations/upgrades, shell scripting for job automation and other maintenance activities.
- Sound knowledge of ORACLE 9i, Core Java, Jsp, Servlets and experience in SQL and PL/SQL concepts database stored procedures, functions and Triggers.
Big data/Hadoop: Hadoop2.7/2.5, HDFS1.2.4, Map Reduce, Hive, Pig, Sqoop, Oozie, Hue
NoSQL Databases: HBase, MongoDB3.2 & Cassandra
Java/J2EE Technologies: Servlets, JSP, JDBC, JSTL, EJB, JAXB, JAXP, JMS, JAX-RPC, JAX- WS
Programming Languages: Java, Python, SQL, PL/SQL, AWS, Hive QL, UNIX Shell Scripting, Scala
IDE and Tools: Eclipse 4.6, Netbeans 8.2, BlueJ
Database: Oracle 12c/11g, MYSQL, SQL Server 2016/2014
Application Server: Apache Tomcat, JBoss, IBM Web sphere, Web Logic
Operating Systems: Windows8/7, UNIX/Linux and Mac OS.
Other Tools: Maven, ANT, WSDL, SOAP, REST.
Methodologies: Software Development Lifecycle (SDLC), Waterfall, Agile UML, Design Patterns (Core Java and J2EE)
Senior Bigdata/Hadoop Developer
Confidential, Seattle, WA
- Analyzed the weblog data using the HiveQL, integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as MapReduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Transferred purchase transaction details from legacy systems to HDFS.
- Developed Java MapReduce programs on log data to transform into structured way to find user location, age group, spending time.
- Developed PIG UDF'S for manipulating the data as per the business requirements and worked on developing custom PIG Loaders.
- Collected and aggregated large amounts of weblog data from different sources such as web servers, mobile and network devices using Apache Flume and stored the data into HDFS for analysis.
- Worked on the Ingestion of Files into HDFS from remote systems using MFT (Managed File Transfer)
- Experience in monitoring and managing Cassandra cluster.
- Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
- Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
- Wrote the MapReduce jobs to parse the weblogs which are stored in HDFS.
- Developed the services to run the MapReduce jobs as per the requirement basis.
- Used R for prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, Caffe, TensorFlow, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Worked on querying data using Spark SQL on top of pyspark engine.
- Importing and exporting data into HDFS from Oracle 10.2 database and vice versa using SQOOP.
- Extracted files from NoSQL database, Cassandra through Sqoop and placed in HDFS for processing.
- Responsible to manage data coming from different sources.
- Analyzed the data using the Pig to extract number of unique patients per day and most purchased medicine.
- Wrote UDF's for Hive and Pig that helped spot market trends.
- Good knowledge in running Hadoop streaming jobs to process terabytes of xml format data.
- Analyzed the Functional Specifications.
- Implemented the workflows using Apache Oozie framework to automate tasks.
Environment: Hadoop, HDFS, pig, Hive, Tez, Accumulo, Flume, Spark SQL, Sqoop, Oozie, Cassandra.
Confidential, Dover, NH
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and preprocessing.
- Worked on Installing and configuring the HDP Horton works 2.x Clusters in Dev and Production Environments.
- Built Kafka-Spark-Cassandra Scala simulator for Met stream, a big data consultancy; Kafka-Spark-Cassandra prototypes.
- Worked on Capacity planning for the Production Cluster.
- Installed HUE Browser.
- Involved in loading data from UNIX file system to HDFS and creating Hive tables, loading with data and writing hive queries which will run internally in map reduce way.
- Worked on Installation of HORTONWORKS 2.1 in AWS Linux Servers and Configuring Oozie Jobs
- Create a complete processing engine, based on Horton works distribution, enhanced to performance.
- Performed on cluster up gradation in Hadoop from HDP 2.1 to HDP 2.3.
- Ability to Configuring queues in capacity scheduler and taking Snapshot backups for Hbase tables.
- Worked on fixing the cluster issues and Configuring High Availability for Name Node in HDP 2.1.
- Involved in Cluster Monitoring backup, restore and troubleshooting activities.
- Responsible for implementation and ongoing administration of Hadoop infrastructure
- Managed and reviewed Hadoop log files.
- Importing and exporting data from different databases like MySQL, RDBMS into HDFS and HBASE using Sqoop.
- Worked on Configuring Kerberos Authentication in the cluster
- Very good experience with all the Hadoop eco systems in UNIX environment.
- Experience with UNIX administration.
- Worked on installing and configuring Solr 5.2.1 in Hadoop cluster.
- Hands on experience in installation, configuration, management and development of big data solutions using Horton works distributions.
- Worked on indexing the Hbase tables using Solr and indexing the Json data and Nested data.
- Hands on experience on installation and configuring the Spark and Impala.
- Successfully install and configuring Queues in Capacity scheduler and Oozie scheduler.
- Worked on configuring queues in and Performance Optimization for the Hive queries while Performing
- Tuning in the Cluster level and adding the Users in the clusters.
- Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes, Troubleshooting, Manage and review data backups, Manage & review log files.
- Day to day responsibilities includes solving developer issues, deployments moving code from one environment to other environment, providing access to new users and providing instant solutions to reduce the impact and documenting the same and preventing future issues.
- Adding/installation of new components and removal of them through Ambari.
- Collaborating with application teams to install operating system and Hadoop updates, patches, version upgrades.
- Monitored workload, job performance and capacity planning
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions.
- Inventing and deploying a corresponding Solr Cloud collection.
- Creating collections and configurations, Register a Lily Hbase Indexer configuration with the Lily Hbase Indexer Service.
- Creating and managing the Cron jobs.
Environment: Hadoop, MapReduce, Yarn, Hive, HDFS, PIG, Sqoop, Solr, Oozie, Impala, Spark, Hortonworks, Flume, HBase, Zookeeper, Unix, Hue (Beeswax), AWS.
- Gathered User requirements and designed technical and functional specifications.
- Installed, Configured and Maintained Hadoop clusters for application development and Hadoop tools like Hive, PIG, HBase, Zookeeper and Sqoop.
- Loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop and load into Hive tables, which are partitioned.
- Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements and experienced in Sqoop to import and export the data from Oracle & MySQL.
- Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
- Imported and exported the analysed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Worked on importing and exporting data into HDFS and Hive using Sqoop.
- Used Flume to handle streaming data and loaded the data into Hadoop cluster.
- Developed and executed hive queries for de-normalizing the data.
- Developed the Apache Storm, Kafka, and HDFS integration project to do a real-time data analysis.
- Responsible for executing hive queries using Hive Command Line, Web GUI HUE and Impala to read, write and query the data into HBase.
- Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
- Worked on Cluster of size 130 nodes.
- Designed Apache Airflow entity resolution module for data ingestion into Microsoft SQL Server.
- Developed batch processing pipeline to process data using python and airflow. Scheduled spark jobs using airflow.
- Involved in writing, testing, and running MapReduce pipelines using Apache Crunch.
- Managed, reviewed Hadoop log file, and worked in analysing SQL scripts and designed the solution for the process using Spark.
- Created reports in TABLEAU for visualization of the data sets created and tested native Drill, Impala and Spark connectors.
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, Map Reduce and then loading data into HDFS.
- Exporting of result set from HIVE to MySQL using Sqoop export tool for further processing.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.