- Over all 6+ years of professional IT work experience in Analysis, Design, Development, Deployment and Maintenance of critical software and big data applications.
- 4+ years of hands on experience across Hadoop eco system that includes extensive experience into Big Data technologies like MapReduce, YARN, HDFS, Apache Cassandra, HBase, Oozie, Hive, Sqoop, Pig, Zoo Keeper and Flume.
- In depth knowledge in HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming .
- Expertise in converting Map Reduce programs into Spark transformations using Spark RDD's.
- Expertise in Spark Architecture including Spark Core, Spark SQL , Data Frames, Spark Streaming and Spark MLlib.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala .
- Experience in implementing Real - Time event processing and analytics using messaging systems like Spark Streaming.
- Experience in using Kafka and Kafka brokers to initiate spark context and processing live streaming information with the help of RDD.
- Good knowledge on Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Experience with all flavor of Hadoop distributions, including Cloudera, Hortonworks, Mapr and Apache.
- Experience in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (5.X) distributions and on Amazon web services (AWS).
- Expertise in implementing SparkScala application using higher order functions for both batch and interactive analysis requirement.
- Extensive experienced working with Spark tools like RDD transformations, spark MLlib and spark QL.
- Hands on experience in writing Hadoop Jobs for analyzing data using Hive QL (Queries), Pig Latin (Data flow language), and custom MapReduce programs in Java.
- Experienced in working with structured data using HiveQL , join operations, Hive UDFs , partitions , bucketing and internal / external tables.
- Extensive experience in collecting and storing stream data like log data in HDFS using Apache Flume .
- Experienced in using Pig scripts to do transformations, event joins, filters and some pre-aggregations before storing the data onto HDFS.
- Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python/Java into Pig Latin and HQL (HiveQL).
- Good Experience with NoSQL Databases like HBase, MongoDB and Cassandra.
- Experience on using Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
- Hands on experience in querying and analyzing data from Cassandra for quick searching, sorting and grouping through CQL.
- Experience working with MongoDB for distributed storage and processing.
- Good knowledge and experienced in Extracting files from MongoDB through Sqoop and placed in HDFS and processed.
- Worked on importing data into HBase using HBase Shell and HBase Client API .
- Experience in designing and developing tables in HBase and storing aggregated data from Hive Table.
- Good knowledge in working with scheduling jobs in Hadoop using FIFO , Fair scheduler and Capacity scheduler.
- Experienced in designing both time driven and data driven automated workflows using Oozie and Zookeeper .
- Experience working on Solr for developing search engine on unstructured data in HDFS.
- Extensively used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandra key spaces.
- Experience in writing stored procedures and complex SQL queries using relational databases like Oracle, SQL Server, and MySQL .
- Experience in Extraction, Transformation and Loading ( ETL ) of data from multiple sources like Flat files, XML files, and Databases .
- Supported various reporting teams and experience with data visualization tool Tableau.
- Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing and
- ETL Tools like IBM DataStage, Informatica and Talend.
- Experienced and in-depth knowledge of cloud integration with AWS using Elastic Map Reduce ( EMR ), Simple Storage Service ( S3 ), EC2 , Redshift and Microsoft Azure .
- Detailed understanding of Software Development Life Cycle ( SDLC ) and strong knowledge in project implementation methodologies like Waterfall and Agile .
Languages: C, C++, Python, R, PL/SQL, Java, HiveQL, Pig Latin, Scala, UNIX shell scripting.
Hadoop Ecosystem: HDFS, YARN, Scala, Map Reduce, Hive, Pig, Zookeeper, Sqoop, Oozie, Bedrock, Flume, Kafka, Impala, NiFi, MongoDB, HBase.
Databases: Oracle, MS-SQL Server, MySQL, PostgreSQL, NoSQL (HBase, Cassandra, MongoDB), Teradata.
Tools: Eclipse, NetBeans, Informatica, IBM DataStage, Talend, Maven, Jenkins.
Hadoop Platforms: Hortonworks, Cloudera, Azure, Amazon Web services (AWS).
Operating Systems: Windows XP/2000/NT, Linux, UNIX.
Amazon Web Services: Redshift, EMR, EC2, S3, RDS, Cloud Search, Data Pipeline, Lambda.
Version Control: GitHub, SVN, CVS.
Packages: MS Office Suite, MS Vision, MS Project Professional.
- Involved in loading data from UNIX file system to HDFS using Shell Scripting.
- Hands on experience on linux shell scripting.
- Importing and exporting data into HDFS from Oracle database using NIFI .
- Started using apache NiFi to copy the data from local file system to HDFS.
- Worked on NiFi data Pipeline to process large set of data and configured Lookup’s for Data Validation and Integrity.
- Worked with different file formats like Json, AVRO and parquet.
- Experienced in using apache Hue and Ambari to manage and monitor the Hadoop clusters.
- Experienced in using version control systems like SVN , GIT build tool Maven and continuous integration tool Jenkins .
- Good experience in using Relational databases Oracle , SQL Server and PostgreSQL .
- Worked with agile, Scrum and Confidential software development framework for managing product development.
- Using Ambari to monitor node’s health and status of the jobs in Hadoop clusters.
- Implemented Kerberos for strong authentication to provide data security.
- Involved in creating Hive tables, loading and analyzing data using hive queries.
- Experience in creating dash boards and generating reports using Tableau by connecting to tables in Hive and HBase.
- Created Sqoop jobs to populate data present in relational databases to hive tables.
- Experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to Relational Database Systems and vice - versa. Skilled in Data migration and data generation in Big Data ecosystem.
- Experienced in building highly scalable Big-data solutions using Hadoop and multiple distributions i.e., Cloudera, Hortonworks and NoSQL platforms (Hbase).
- Implementation of Big data batch processes using Hadoop, Map Reduce, YARN, Pig and Hive.
- Experience in importing and exporting data using Sqoop from HDFS/Hive/HBase to Relational Database Systems and vice-versa.
- Developed workflow in Oozie to orchestrate a series of Pig scripts to cleanse data such as removing personal information or merging many small files into a handful of very large, compressed files using Pig pipelines in the data preparation stage.
- Captured data logs from web server and Elastic search into HDFS using Flume for analysis.
- Managed and reviewed Hadoop log files.
- Hands-on experience with Confidential Big Data product offerings such as Confidential Info Sphere Big Insights, Confidential Info Sphere Streams, Confidential BigSQL .
- Load and transform large sets of structured, semi-structured using Hive and Impala with elastic search
Environment: Hadoop, HDFS, PIG, Hive, Sqoop, Oozie, Cloudera, ZooKeeper, Oracle, Shell Scripting, Nifi, Unix, Linux,BigSQL.
Confidential, Houston, Texas
- Responsible for building scalable distributed data solutions using Hadoop.
- Experience in creating Kafka producer and Kafka consumer for Spark streaming which gets the data from different learning systems of the patients.
- Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS using Scala.
- Used SparkStreaming to divide streaming data into batches as an input to Sparkengine for batch processing.
- Evaluated the performance of Apache Spark in analyzing genomic data.
- Performed advanced procedures like text analytics and processing using the in-memory computing capabilities of Spark.
- Implemented Spark RDD transformations to Map business analysis and apply actions on top of transformations.
- Written Storm topology to accept the events from Kafka producer and emit into Cassandra DB.
- Created POC using SparkSQL and MLlib libraries.
- Experienced in managing and reviewing Hadoop log files.
- Worked closely with EC2 infrastructure & Trifacta to troubleshoot complex issues.
- Worked with AWS cloud and created EMR clusters with spark for analyzing raw data processing and access data from S3 buckets.
- Involved in installing EMR clusters on AWS.
- Used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Designed the NIFI/HBASE pipeline to collect the processed customer data into Hbase tables.
- Apply Transformation rules on the top of DataFrames.
- Worked with different File Formats like TEXTFILE,TRIFACTA, AVROFILE, ORC, and PARQUET for HIVE querying and processing
- Developed Hive UDFs and UDAF’s for rating aggregation.
- Developed java client API for CRUD and analytical Operations by building a restful server and exposing data from No-SQL databases like Cassandra via rest protocol.
- Created Hive tables and involved in data loading and writing Hive UDFs.
- Experience in managing and reviewing Hadoop Log files.
- Worked extensively with Sqoop to move data from DB2 and Teradata to HDFS.
- Collected the logs data from web servers and integrated in to HDFS using Kafka.
- Provided ad-hoc queries and data metrics to the Business Users using Hive, Impala.
- Worked on various performance optimizations like using distributed cache for small datasets, partition, bucketing in hive, map side joins etc.
- Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability
- Implemented Row Level Updates and Real time analytics using CQL on Cassandra Data.
- Used Cassandra (CQL) with Java API's to retrieve data from Cassandra tables.
- Worked on analyzing and examining customer behavioral data using Cassandra.
- Worked on Solr configuration and customizations based on requirements.
- Indexed documents using Apache Solr.
- Extensively use Zookeeper as job scheduler for Spark Jobs.
- Worked with BI teams in generating the reports on Tableau.
- Used JIRA for bug tracking and CVS for version control.
- Met with business/user groups to understand the requirement for new Data Lake Project.
- Worked in Agile Iterative sessions to create HadoopData Lake for the client.
- Defined the reference architecture for Big Data Hadoop to maintain structured and unstructured data within the enterprise.
- Lead the efforts to develop and deliver the data architecture plan and data models for the multiple data warehouses and data marts attached to the Data Lake Project.
Environment: Hadoop, MapReduce, HDFS, PIG, Hive,Data Robot, Sqoop, Oozie, Storm, Kafka, Spark, Spark Streaming, Scala, Cassandra, Cloudera, ZooKeeper, AWS, Solr, MySQL, Shell Scripting, Java, Tableau.
Confidential, Stamford, CT
- Developed solutions to process data into HDFS (Hadoop Distributed File System), process within Hadoop and emit the summary results from Hadoop to downstream systems.
- Developed a Wrapper Script around Teradata connector for Hadoop TCD to support option parameters.
- Used Sqoop extensively to ingest data from various source systems into HDFS.
- Hive was used to produce results quickly based on the report that was requested.
- Played a major role in working with the team to leverage Sqoop for extracting data from Teradata.
- Imported data from different relational data sources like Oracle, Teradata to HDFS using Sqoop.
- I ntegrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.
- Integrated multiple sources data (SQL Server, DB2, TD) into Hadoop cluster and analyzed data by Hive-HBase integration.
- Involved in Hive-Hbase integration by creating hive external tables and specifying storage as Hbase format.
- Implemented Data Validation using MapReduce programs to remove un-necessary records before move data into Hive tables.
- Developed PIG UDFs for the needed functionality such as custom Pigs loader known as timestamp loader.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc.
- Oozie and Zookeeper were used to automate the flow of jobs and coordination in the cluster respectively.
- Involved in moving log files generated from various sources to HDFS for further processing through Flume.
- Worked on different file formats like Text files, Parquet, Sequence Files, Avro , Record columnar files (RC).
- Developed several shell scripts, which acts as wrapper to start theseHadoop jobs and set the configuration parameters.
- Kerberos security was implemented to safeguard the cluster.
- Worked on a stand-alone as well as a distributed Hadoop application.
- Tested the performance of the data sets on various NoSQL databases.
- Understood complex data structures of different type (structured, semi structured) and de-normalizing for storage in Hadoop.
Environment: Hadoop, HDFS, Pig, Flume, Hive, MapReduce, Sqoop, Oozie, Zookeeper, HBase, Java Eclipse, SQL Server, Shell Scripting.
Confidential, Minneapolis, MN
- Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java for data cleaning and preprocessing.
- Migrated existing SQL queries to HiveQL queries to move to big data analytical platform.
- Integrated Cassandra file system to Hadoop using Map Reduce to perform analytics on Cassandra data.
- Installed and configured Cassandra DSE multi-node, multi-data center cluster.
- Designed and implemented a 24 node Cassandra cluster for single point inventory application.
- Analyzed the performance of Cassandra cluster using nodetool TP stats and CFstats for thread analysis and latency analysis.
- Implemented Real time analytics on Cassandra data using thrift API .
- Responsible to manage data coming from different sources.
- Supported Map Reduce Programs those are running on the cluster.
- Involved in loading data from UNIX file system to HDFS .
- Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Load and transform large sets data into HDFS using Hadoopfs commands.
- Scheduled Oozie workflow engine to run multiple Hive and Pig jobs, which independently run with time and data availability.
- Implemented UDFS, UDAFS in java and python for hive to process the data that can’t be performed using Hive inbuilt functions.
- Did various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing map side joins etc.
- Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate report.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts
- Supported in setting up updating configurations for implementing scripts with Pig and Sqoop .
- Designed the logical and physical data modeling wrote DML scripts for Oracle 9i database.
- Used Hibernate ORM framework with Spring framework for data persistence.
- Wrote test cases in JUnit for unit testing of classes.