Senior Big Data Consultant Resume
SUMMARY
- Open Source Contributor at Apache Sqoop, Apache Kudu and StreamSets Data Collector Project.
- Strong experience creating real time data streaming solutions using Apache Spark Core, Spark SQL & Data Frames, Spark Streaming, Apache Storm, Kafka.
- Experience in building Data - Pipelines using Big Data Technologies
- Hands-on experience in writing MapReduce programs and user-defined functions for Hive and Pig
- Experience in NoSQL technologies like HBase, Cassandra
- Excellent understanding /knowledge on Hadoop (Gen-1 and Gen-2) and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager (YARN).
- Excellent understanding and knowledge of NOSQL databases like HBase, and Cassandra.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems (RDBMS) and from RDBMS to HDFS.
- Proficient at using Spark APIs to cleanse, explore, aggregate, transform, and store machine sensor data
- Configured a 20-30 node (Amazon EC2 spot Instance) Hadoop cluster to transfer teh data from Amazon S3 to HDFS and HDFS to Amazon S3 and also to direct input and output to teh Hadoop MapReduce framework.
- Hands-on experience wif systems-building languages such as Scala, Java
- Hands-on experience wif message brokers such as Apache Kafka and RabbitMQ.
- Worked extensively wif Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses.
- Implemented Hadoop based data warehouses, integrated Hadoop wif Enterprise Data Warehouse systems
- Built real-time Big Data solutions using HBASE handling billions of records.
- Involved in designing teh data model in Hive for migrating teh ETL process into Hadoop and wrote Pig Scripts to load data into Hadoop environment.
- Expertise in writing Hive UDF, Generic UDF’s to in corporate complex business logic into hive queries in teh process of performing high level data analysis.
- Worked on Spark Machine Learning library for Recommendations, Coupons Recommendations, Rules Engine.
- Experience in working wif various Cloudera distributions (CDH4/CDH5) and has knowledge on Confidential and Amazon EMR Hadoop Distributions.
- Experience in administering large scale Hadoop environments including design, configuration, installation, performance tuning and monitoring of teh cluster using Cloudera manager and ganglia.
- Worked extensively wif Dimensional modeling, Data migration, Data cleansing, Data profiling, and ETL Processes features for data warehouses
- Experience in Object Oriented Analysis Design (OOAD) and development of software using UML Methodology, good knowledge of J2EE design patterns and Core Java design patterns.
- Experience in designing both time driven and data driven automated workflows using Oozie.
- Experience in writing UNIX shell scripts.
TECHNICAL SKILLS
Big Data Technologies: Hadoop ( Confidential, Cloudera, MapR), Spark, Spark Streaming, Spark Sql, Spark ML, Mapreduce, HDFS, Cassandra, Storm, Apache Kafka, Streamsets, Flume, Oozie, Solr, Zookeeper, Solr, Tez, Data Modelling, Pig, Hive, Impala, Drill, Sqoop and RabbitMQ.
NOSQL Database: Hbase, Cassandra
SQL DB’s: Hive, Pig, PrestoDB, Impala, SparkQL
Search: Hsearch, Apache Blur, Lucene, Elasticsearch, Nutch
Programming Languages: Java, Scala, Python, Basic’s ( Clojure)
Cloud Platform: Amazon Web Services (EC2, Amazon Elastic Mapreduce,Amazon S3)Google Cloud Platform (Bigquery, App Engine, Compute Engine, Cloud SQL), Rackspace (CDN, Servers, Storage), Linode Manager
Monitoring and Reporting: Ganglia, Nagios, Custom Shell scripts, Tableau, D3.js, Google Charts
Data: E-Commerce, Social Media, Logs and click events data, Next Generation Genomic Data, Oil & Gas, Healthcare, Travel
Other: HTML, JavaScript, Extjs, CSS, JQuery
PROFESSIONAL EXPERIENCE
Senior Big Data Consultant
Confidential
Responsibilities:
- me has successfully written Spark CoreRDD application to read auto generated 1 billion records and compare wif IgniteRDD in Yardstick framework to measure performance of Apache Ignite RDD and Apache Spark RDD.
- me has successfully written Spark DataFrame application to read from HDFS and analyze 10 million twitter records using Yardstick framework to measure performance of Apache Ignite SQL and Apache Spark DataFrame.
- me has successfully written Spark Streaming application to read streaming twitter data and analyze twitter records in real time using Yardstick framework to measure performance of Apache Ignite Streaming and Apache Spark Streaming.
- Implemented test cases for Spark and Ignite functions using Scala as language.
- Hands-on experience in setting up 10 node Spark cluster on Amazon Web Service’s using Spark EC2 script.
- Implemented D3.js and Tableau charts to show performance difference between Apache Ignite and Apache Spark.
Environment: Spark, Spark Core, Data Frame’s, Spark Streaming, Scala, HDFS, Apache Ignite, Yardstick Tool, D3js, Tableau, AWS, 10 million twitter data records and 1 billion auto generated records.
Senior Big Data ConsultantConfidential
Responsibilities:
- Crawling of 100+ sites Data using Nutch
- Fashion based ontology maintenance
- Using Scala, Spark & echo system to enriched given data using Fashion Ontology to Validation/Normalizing teh data
- Designed schema and modeling of data and Written algorithm to store all validated data in Cassandra using Spring Data Cassandra REST
- Programs for Validation/Normalizing/Enriching and REST API to Develop UI Based on manual QA Validation. Used SparkSQL, Scala to running QA based SQL queries.
- To standardize teh Input Merchants data
- To upload images to RackSpace CDN
- To index teh given Data sets into HSearch
- To MR programs on Hbase to extract teh color information from Images including density.
- To MR programs on Hbase to persist teh data on Hbase tables above MR jobs will run based on timing and bucketing. Color-Obsessed:
- Setting up teh Spark Streaming and Kafka Cluster
- Developed a Spark Streaming Kafka App to Process Hadoop Jobs Logs
- Kafka Producer to send all slaves logs to Spark Streaming App
- Spark Streaming App to Process teh Logs wif given rules and produce teh Bad Images, Bad records, Missed Records etc.
- Spark Streaming App collect user actions data from front end
- Kafka Producer based Rest API to collect user events and send to Spark Streaming App
- Hive Queries to Generate Stock Alerts, Price Alerts, Popular Products Alerts, New Arrivals for each user based on given likes, favorite, shares count information
- Worked on SparkML library for Recommendations, Coupons Recommendations, Rules Engine.
Environment: HSearch (Hbase+lucene), Cassandra, Hive, Spark (Core, SQL, ML, Streaming), Hadoop, MapReduce, Amazon Webservice, Linode, CDN, Scala, Java, Affiliates feeds Rakuten, CJ, Affiliate window, Webgains.
Senior Big Data ConsultantConfidential
Responsibilities:
- Written MapReduce programs to validate teh data
- Written more TEMPthan 50 Spring Data Hbase rest API's in Java
- Schema design on Hbase and cleaning data
- Written Hive queries for analytics on user’s data.
Environment: Hadoop MapReduce, Hbase, Spring Data Rest Web Service, CDH, Users Payment Data
Senior Big Data ConsultantConfidential
Responsibilities:
- Written a simulator to send/emit events based on NYC DOT data file.
- Written Kafka Producer to accept/send events to Kafka Producer which is on Storm Spout
- Written Storm topology to accept events from Kafka Producer and Process Events
- Written Storm Bolt to Emit data into Hbase, HDFS, Rabbit-MQ Web Stomp
- Hive Queries to Map Truck Events Data, Weather Data, Traffic Data
Environment: Hadoop, HDFS, Hive, HBase, Kafka, Storm, Rabbit-MQ WebStormp, Google Maps, New York City Truck Routes from NYC DOT. -Truck Events Data generated using a custom simulator. - Weather Data, collected using APIs from Forcast.io. -Traffic Data, collected using APIs from MapQuest.
Senior Big Data ConsultantConfidential
Responsibilities:
- Installation of Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB, Hadoop, Cloudera CDH, Confidential HDP
- Schema design for data sets on all Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB
- Query design for given data set
- Debugging on Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB, Hadoop, Cloudera CDH, Confidential HDP
- Time Comparison of each Hive, Hive on Tez, Impala, SparkQL, Apache Drill, BigQuery, PrestoDB
- Time comparison between different cloud platforms
- Times metrics web based visualization design on google charts
Environment: Hive, Hive on Tez, Impala, SparkSQL, Apache Drill, BigQuery, PrestoDBHadoop, Cloudera
Senior Big Data Consultant
Confidential
Responsibilities:
- Developed a Hadoop MapReduce program to perform sequence alignment on NGS data.
- Teh MapReduce program implements algorithms such as Borrows-Wheeler Transform (BWT), Ferragina-Manzini Index (FMI), Smith-Waterman dynamic programming algorithm using Hadoop distributed cache.
- Design and development of software for Bioinformatics, Next Generation Sequencing (NGS) in Hadoop MapReduce framework, Cassandra using Amazon S3, Amazon EC2, Amazon Elastic MapReduce(EMR).
- Developed Hadoop MapReduce program to perform custom Quality Check on genomic data. Novel features of teh program included capability to handle fileformat/sequencing-machine errors, automatic detection of base-line PHRED score and being platform agnostic (Illumina, 454 Roche, Complete Genomics, ABI Solid input format data).
- Developed a Hadoop MapReduce program to perform sequence alignment on sequencing data. Teh MapReduce program implements algorithms such as Borrows-Wheeler Transform (BWT), Ferragina-Manzini Index (FMI), SmithWaterman dynamic programming algorithm using Hadoop distributed cache.
- Configured and ran all MapReduce programs on 20-30 node cluster (Amazon EC2 spot instances) wif Apache Hadoop-1.4.0 to handle 600GB/sample of NGS genomics data.
- Configured a 20-30 node (Amazon EC2 spot Instance) Hadoop cluster to transfer teh data from Amazon S3 to HDFS and HDFS to Amazon S3 and also to direct input and output to teh Hadoop MapReduce framework.
- Successfully ran all Hadoop MapReduce programs on Amazon Elastic MapReduce framework by using Amazon S3 for Input and Output.
- Developed java Restful web services to upload data from local to Amazon S3, listing S3 objects and file manipulation operations.
- Developed MapReduce programs to perform Quality Check, Sequence Alignment, SNP calling, SV/CNV detection on single-end/paired-end NGS data.
- Designed and transmitted a RDBMS(SQL) Database to NOSQL Cassandra Database.