- Overall 8 years of professional experience in full Software Development Life Cycle (SDLC), AGILE Methodology and analysis, design, development, testing, implementation and maintenance in Hadoop, Data Warehousing, Linux and Java.
- More than 5 years of experience in providing highly scalable solutions for Big Data using Hadoop 2.x, HDFS, MR2, YARN, Kafka, PIG, Hive, Sqoop, HBase, Cloudera Manager, Zoo keeper, Oozie, Hue.
- Hands on experience in installation and configuration Amazon EMR, Cloudera (CDH3, CDH4, &CDH5), and Horton Works Hadoop Distributions.
- Excellent understanding/knowledge on Big data, Hadoop Architecture, NoSQL and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce2, YARN programming paradigm.
- Good understanding and knowledge of NOSQL databases like MongoDB, HBase, Amazon RedShift and Cassandra.
- Hands on experience in providing real - time data streaming solutions by building ETL pipeline using Apache Spark/Spark Streaming/Apache Strom, Kafka, Flume, and HDFS.
- Extensive experience in working with different Spark modules like Spark transformations, MLib, Streaming and Spark QL.
- Experience in importing and exporting data using Sqoop (Structured Data) and Flume (Log Files & XML) from HDFS/Hive/HBase to Relational Database Systems (RDBMS) and vice-versa.
- Integrated various data sources like Oracle, Teradata, MySQL, Sybase, SQL server, MS access and non-relational sources like flat files into staging area.
- Good experience in writing custom UDF’s and extending PIG scripts and Hive Queries to incorporate complex business logic into queries for high level data analysis.
- Experience in working with Amazon AWS cloud services (EC2, EBS, S3) and data migration between different database platforms like SQL server to S3.
- Worked with Oozie and Zookeeper to manage job workflow and job coordination in the cluster respectively.
- Hands on experience in using BI tools like Splunk/Hunk, Tableau for Data Visualization.
- Worked on Optimizing Map Reduce code, pig scripts, user interface analysis, performance tuning and analysis.
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Worked extensively with Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Having good working experience in Agile/Scrum methodologies, technical discussion with client and communication using scrum calls daily for project analysis specs and development aspects.
- Ability to work independently as well as in a team and able to effectively communicate with customers, peers and management at all levels in and outside the organization.
Languages/Scripting: Java/ J2EE, Python, C++, Pig Latin, HiveQL, SQL, PL/SQL, LINUX shell scripts, Java, Scala
Big Data Framework/Stack: Hadoop HDFS, MapReduce, YARN, Hive, Pig, Hue, Impala, Sqoop, HBase, Spark, Ooozie, Zookeeper, Drill, Solr etc.
Hadoop Distributions: Apache, Cloudera CDH5, Horton Works
Fast Data Technologies: Kafka, Flume, Apache Spark, Strom
RDBMS: Oracle, DB2, SQL, MySQL Server, Sybase, MS Access
No SQL Databases: HBase, MarkLogic, Cassandra, MongoDB
Software Methodologies: SDLC- Waterfall / Agile, Scrum
Operating Systems: Windows XP/NT/7/8, UNIX, LINUX, Mac
Java Technologies: Hibernate, JDBC, ORM, JNDI, JSP, JSON, XML, HTML, Web Services, Spring, Struts
File Formats: XML, Text, Sequence, RC, JSON, ORC, AVRO, and Parquet etc
Amazon Web Services: Amazon EMR, EC2, EBS, S3, RedShift, BeanStalk, CloudFront, Virtual Private Cloud
- Responsible for building scalable distributed data solutions using Hadoop.
- Developed a data pipeline by integrating Kafka & Flume to collect, aggregate, and store the data from different sources and pushed to HDFS.
- Configured Spark streaming to ingest data from the sensors through Kafka and onto HDFS for near real-time analytics using Scala/Python.
- Performed real-time analytics on Call Detail Records (CDR) of order 5 TB through ingestion with Apache Flume onto Hdfs and Spark to identify troubling patterns for network drops.
- Extracted the data from web servers onto HDFS using Flume.
- Analytics on products in specific to local geos and customer segments to gain better insights.
- Helped network administrators in allocating bandwidths in real-time by identifying the spikes in call center data.
- Good experience in Hive partitioning, bucketing and performing different types of joins on Hive tables and implementing Hive Serdes like REGEX, JSON and Avro.
- Worked with different Hive file formats like RC file, Sequence file, ORC file format and Parquet.
- Developed Hive User Defined Functions in java, compiling them into jars and adding them to the HDFS and executing them with Hive Queries for data validations and processing.
- Developed Map Reduce applications (Java/Python) using Hadoop Map-Reduce programming framework for processing.
- Implemented different kind of joins to integrate data from different data sets like Map and reduce side join.
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop.
- Loaded and transformed large sets of semi structured data using Pig Latin operations.
- Import data from open data sources into Amazon S3 and Pre-Processing large data sets in parallel across the Hadoop cluster.
- Involved in defining job flows using Oozie for scheduling jobs to manage apache Hadoop jobs by Directed Acyclic graph (DAG) of actions with control flows.
- Implemented a variety of compression techniques to save data and optimize data transfer over network using Lzo, Snappy, etc.
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning and slots configuration.
- Assisted in monitoring the Hadoop cluster using Ganglia tool.
- Implemented test scripts to support test driven development and continuous integration.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports (Tableau, Splunk) for the BI team
- Generate final reporting data using Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
- Moved all crawl data flat files generated from various retailers to HDFS for further processing.
- Import/export data from Teradata database to/from HDFS using Sqoop.
- Performed optimization on Pig scripts and Hive queries increase efficiency and add new features to existing code.
- Written Map Reduce code that will take input as log files and parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Created External Hive Table on top of parsed data.
- Developed, Monitored and Optimized MapReduce jobs for data cleaning and preprocessing.
- Developed the Sqoop scripts in order to make the interaction between Pig and MySQL Database.
- Extracted feeds from social media sites such as Facebook, Twitter using Python scripts.
- Worked with Hadoop administrator in rebalancing blocks and decommissioning nodes in the cluster.
- Implemented Hibernate for O/R mapping and persistence.
- Involved in gathering the requirements, designing, development and testing.
- Writing the script files for processing data and loading to HDFS.
- Installed Oozie Workflow engine to run multiple Hive and Pig Jobs.
- Developed the UNIX shell scripts for creating the reports from Hive data.
- Created two different users (hduser for performing hdfs operations and map red user for performing map reduce operations only).
- Managing and reviewing Hadoop log files.
- Setup Hive with MySQL as a Remote Metastore.
- Generated aggregations and groups and visualizations using Tableau.
- Moved all log/text files generated by various products into HDFS location.
- Writing MapReduce jobs using Java API.
- Writing shell scripts to monitor the health check of Hadoop daemon services and responding accordingly to any warning or failure conditions.
- Managing and scheduling Jobs on a Hadoop cluster.
- Deployed Hadoop Cluster in the different modes- Standalone, Pseudo-distributed, Fully Distributed.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
- Developed Scripts and Batch Job to schedule various Hadoop Programs.
- Installed and maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase and Sqoop.
- Installed and configured Pig and also written Pig Latin scripts.
- Developed the Pig UDF’S to pre-process the data for analysis.
- Develop Hive queries for the analysts.
- Writing Hive queries for data analysis to meet the business requirements.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce jobs given by the users.
- Took part in monitoring, troubleshooting and managing Hadoop log files.