- Apache Hadoop and Big Data Consultant with 6+ years of industry experience including more than 2 years of experience in Hadoop ecosystem.
- Backgrounds include extensive hadoop consulting experience, automation tools, multi - node hadoop cluster installation, upgrademanagement and troubleshooting.
- Exposure to Apache Spark, Apache Kafka and other Apache open source tools.
- Proficient in designing complete end to end Hadoop infrastructure solution right from gathering requirements, analyzing, implementing proof of concepts, production deployment and data analysis.
- Proficient in installing Cloudera and HortonWorks tools for Big Data Analysis in a Production Cluster and a good understanding of Hadoop Ecosystem.
- Experience in planning and executing a clean upgrade process within HortonWorks and Cloudera platforms.
- Extensive experience in capacity planning, performance tuning and optimizing the Hadoop environment.
- Enabling and managing various components in Hadoop Ecosystem like HDFS, YARN, MapReduce, Hive, Pig, Sqoop, Oozie, Sentry, Impala, Spark, HUE and Zookeeper.
- Proficient with both MRv1 and MRv2 (YARN) framework configuration, management and troubleshooting.
- Implemented setting quota and Access Control Lists on job queue on Hadoop Cluster.
- Hands on experience with AD (active directory), Kerberos and other security tools.
- Proficient implementing both Fair Scheduler and Capacity scheduler on the cluster as required for maximum cluster utilization.
- Proficient Troubleshooting user submitted jobs and providing feedback to cluster users for job optimization and maximizing cluster utilization.
- Experience installing and managing Hadoop on public cloud environment - Amazon Web Services (AWS).
- Using Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
- Extensive experience in Hive and Pig for analyzing data in HDFS. Writing custom function to load data having complex schema into HDFS. Optimizing Hive and Pig queries to leverage parallelization of MapReduce framework.
- Experience in providing ad-hoc queries and data metrics on large data sets using Hive and Pig and proficient in writing user-defined functions (Eval, Filter, Load and Store) and macros.
- Implemented Flume Agents for collecting, aggregating and moving large amount of server logs and streaming data to HDFS.
- Familiar with writing Oozie workflows and Job Controllers for job automation.
- Experience with Oozie bundles and coordinator to run jobs to meet Service Level Agreement (SLA).
- Proficient in programming with Resilient Distributed Datasets (RDDs).
- Experience with Spark Streaming, Sql, MLib, GraphX and integrating Spark with HDFS, Cassandra, S3 and HBase.
- Experience in tuning and debugging Spark application running.
- Experience integration of Kafka with Spark for real time data processing.
- Experience using Kafka to build messaging systems to for data processing pipelines.
- Developed custom Kafka producer and consumer for different publishing and subscribing to Kafka brokers.
- Experience with Spark Streaming and Kafka integration to the Hadoop Cluster and other Hadoop Ecosystem.
- Expertise in maintaining a cluster infrastructure using Puppet.
- Experience in writing Puppet modules for hadoop ecosystem tools and other application management.
- Well experienced in building servers like DHCP, PXE with kickstart, DNS and NFS and used them in building infrastructure in Linux Environment.
- Experienced in Linux Administration tasks like IP Management (IP Addressing, Subnetting, Ethernet Bonding, Static IP).
- Experienced in writing advanced shell scripts.
- Experience in managing LDAP servers.
- Experience in working with AWS services like EC2, S3, Redshift, RDS.
- Proposed and Implemented a solution to connect Hive to AWS Redshift to perform ETL operations.
- Experience in deploying a CDH cluster in a Amazon VPC, exposing only the necessary endpoints to users.
Big Data Ecosystem: Hadoop, HDFS, MapReduce, Hive, Pig, Sqoop, OozieFlume, ZooKeeper, Spark, Kafka,HBase
Hadoop management tools: Cloudera Manager, Apache Ambari, Ganglia, Nagios
Databases: MySQL, SQL Server
Scripting languages: Shell Scripting, Python, Ruby
Software Development Tool: Eclipse, IntelliJ
Operating Systems: Windows, Linux: Redhat, CentOS, Mac OSX
Build Tools: Maven, SBT, Gradle
AWS services: EC2, S3, Redshift, RDS
- Currently working as Hadoop administrator in Cloudera distribution with 5 clusters which included POC clusters and PROD clusters.
- Responsible for Cluster maintenance, commissioning and decommissioning Data nodes, Cluster Monitoring, Troubleshooting, Manage and review data backups, Manage & review Hadoop log files.
- Worked on Hive and its components and troubleshooting if any issues arise with Hive.
- Responsible for cluster availability and experienced on ON-call support.
- Configured Failover Controller when Namenode goes down to bring up Secondary Namenode as primary Namenode.
- Experienced in production support which involves solving the user incidents varies from sev1 to sev5.
- Perform Filesystem Checks (fsck) time to time to check any over replicated blocks, under replicated blocks, misreplicated blocks, corrupt blocks and missing replicas.
- Used Cloudera Manager to manage and monitor the Cluster performance.
- Worked on developing rapid deployment scripts for deploying Hadoop ecosystems using Puppet platform. These scripts installed Hadoop, Hive, PIG, Oozie, Flume, Zookeeper and other components in Hadoop ecosystems along with Monitoring components for Nagios and Ganglia.
- Wrote shell scripts to dynamically scale up or scale down the hadoop data nodes on Rackspace infrastructure using API’s.
- Implemented High Availability to the Namenode in CDH5 Cluster.
- Upgraded the Hadoop cluster from CDH4.7 to CDH5.2.
- Monitored and provided support for development and production cluster.
- Integrating Kerberos Security into the CDH cluster.
- Implemented Performance optimizations for Hadoop ecosystem components like Hive, Pig, SQOOP and OOZIE w.r.t. server infrastructure.
- Ran the benchmarking tools and identified the bottlenecks tuned the cluster to improve the performance.
- Implemented High Availability for ResourceManager in CDH5 cluster.
- Implemented schedulers on the Job tracker to share the resources of the cluster for the mapreduce jobs given by the users.
- Provided L1 and L2 support for the internal team.
- Extensively involved in Cluster Capacity planning, Hardware planning, Installation,
- Performance Tuning of the Hadoop Cluster.
- Benchmarking the cluster using Terasort, TestDFSIO and tuning hadoop configuration parameter and Java Virtual Machine (JVM).
- Developed data pipelines with combinations of Hive, Pig and Sqoop jobs scheduled with Oozie.
- Worked on transferring data between database and HDFS using Sqoop.
- Worked with Hive data warehouse to analyze the historic data in HDFS to identify issues and behavioral patterns.
- Created Hive tables as per requirement which were internal and external tables and used static and dynamic partitions to improve efficiency.
- Implemented UDF’s in java for Hive to process the data that can’t be performed using Hive inbuilt functions.
- Worked with AVRO, RegEx and JSON for serialization and de-serialization packed with hive to parse the content of streamed log data and implemented hive custom UDF’s.
- Worked with Pig scripts for advanced analytics.
- Developed Pig UDF’s in java for custom data for various levels of optimization.
- Worked with Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs.
- Designed and implemented custom writable, custom input formats, custom partitioner and custom comparators.
- Performed Installation and Management of Solr.
- Deployed an Apache Solr search engine which was used to index and query data stored in XML documents.
- Wrote XML and JSON parsing library which was utilized to parse various documents and generate configuration files for Solr
- Wrote custom shell scripts to run incremental indexing to reduce index time for Solr.
- Built a Custom Request Handler and multiple Search Components in Solr for query analysis and searching multiple indexes.
- Improved the relevancy of the search results by customizing Lucene's scoring model for the Solr instance
- Experience in developing custom query parsers and embedding them in Solr.
- Experience in using Stanford NLP to extract entities, parts of speech from various forms of sentences.
- Wrote Regular Expression library to parse Wikipedia data to extract information.
- Provided support for various html clients written by others to connect to Solr.
- Responsible for building a Linux bare metal server-provisioning infrastructure and maintaining the Linux servers.
- Monitoring System Metrics and logs for any issues. Resolution of internal issues faced by the users.
- Running cron-tab to back up data. Using java jdbc to load data into MySQL.
- Adding, removing, or updating user account information, resetting passwords, etc.
- Maintaining the MySQL server and Authentication to required users for databases.
- Creating and managing Logical volumes.