- Hadoop / Spark developer with 8+ years of involvement in Software development which includes 4+ years of experience in Big data and Hadoop Ecosystem components and 4 years in Java development.
- Solid subjective knowledge and hands - on experience in dealing with Apache Hadoop components like HDFS, MapReduce, HiveQL, HBase, Pig, Hive, Sqoop, Oozie, Cassandra, Flume, Apache Spark.
- Currently working on Spark and Spark Streaming extensively using Scala as the main programming dialect.
- Experienced working with Spark Streaming , Spark SQL and Kafka for real-time data processing.
- Extensive experience in working with various distributions of Hadoop Enterprise versions of Cloudera (CDH4/CDH5), Hortonworks and good knowledge on Amazon's EMR (Elastic MapReduce).
- Designing and implementing complete end-to-end Hadoop Infrastructure including Pig , Hive , Sqoop , Oozie , Flume and Zookeeper .
- Hands on expertise in working and designing of Row keys & Schema Design with NOSQL databases like Mongo DB 3.0.1, HBase , Cassandra and DynamoDB (AWS) .
- Designing and creating Hive external tables using shared meta-store instead of derby with partitioning, dynamic partitioning and buckets.
- Extensively used Spark Data frames, Spark-SQL and RDD API of Spark for performing various data transformations and dataset building.
- Experience in importing and exporting data using Sqoop from Relational Databases to HDFS and vice-versa.
- Exposure to Data Lake Implementation using Apache Spark and developed Data pipe lines.
- Extensively worked on Spark using Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
- Extensive experience in importing and exporting streaming data into HDFS using stream processing platforms like Flume and Kafka messaging system.
- Strong experience and knowledge of real time data analytics using Spark Streaming , Kafka and Flume .
- Experience in developing data pipeline using Sqoop , and Flume to extract the data from weblogs and store in HDFS . Accomplished developing Pig Latin Scripts and using Hive Query Language for data analytics.
- Working knowledge of Amazon's Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service ( S3 ) as Storage mechanism.
- Experienced in migrating data from various sources using PUB-SUB model in Apache Kafka , and Kafka producers, consumers and preprocess data using Storm topologies.
- Developed customized UDFs and UDAFs in java to extend Pig and Hive core functionality.
- Excellent understanding and knowledge of NOSQL databases like HBase and Cassandra.
- Experienced in developing Cassandra data model and administering the Cassandra Hadoop Cluster along with pig and Hive
- Expertise in designing columnar families in Cassandra and writing queries in CQL to analyze data from Cassandra tables.
- Experience in using DataStax Spark-Cassandra connectors to get data from Cassandra tables and process them using Apache Spark.
- Expert knowledge in setting up MongoDB clusters and handling service requests for MongoDB.
- Monitoring of Document growth and estimating storage size for large MongoDB clusters.
- Experienced in creating HBase tables and column families to store the user event data and wrote automated HBase test cases for data quality checks using HBase command line tools.
- Experienced in developing end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka through persistence of data into HBase .
- Experienced working with Apache NiFi for building and automating dataflow from data source to HDFS and HDFS to Teradata.
- Extensive experience in ETL Architecture, Development, enhancement, maintenance, Production support, Data Modeling, Data profiling, Reporting including Business requirement, system requirement gathering.
- Hands-on experience in Shell scripting. Knowledge on cloud services Amazon web services(AWS).
- Proficient in using RDMS concepts with Oracle, SQL Server and MySQL.
- Experience in processing different file formats like XML, JSON and sequence file formats.
- Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and efficient processing of Big Data.
- Good Experience in creating Business Intelligence solutions and designing ETL workflows using Tableau.
- Knowledge on Enterprise Data Warehouse (EDW) architecture and various data modeling concepts like Star schema, Snowflake schema and Teradata.
- Good working experience on different OS like UNIX/Linux, Apple Mac OS-X Windows.
- Experience working both independently and collaboratively to solve problems and deliver high quality results in a fast-paced, unstructured environment.
Sr. Hadoop / Spark Developer
Confidential, Woburn, MA
- Worked with Hortonworks distribution of Hadoop for setting up the cluster and monitored it using Ambari.
- Created ODBC connection through Sqoop between Hortonworks and SQL Server.
- Worked with ELK Stack cluster for importing logs into Logstash, sending them to Elasticsearch nodes and creating visualizations in Kibana.
- Used Apache Spark with ELK cluster for obtaining some specific visualization which require more complex data processing/querying.
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Experience in implementing Spark RDD's in Scala.
- Implemented Spark SQL for faster processing of data and handle Skew Data for real time analysis in Spark.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Written transformations in Apache Spark using Data frames, Scala and Spark SQL.
- Configured Spark streaming to get streaming information from the Kafka and store them in HDFS.
- Migrated Flume with Spark for real time data and Developed the Spark Streaming Application with java to consume the data from Kafka and push them into Hive.
- Used Kafka functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds.
- Developed a NiFi Workflow to pick up the data from Data Lake and from SFTP server and send that to Kafka broker.
- Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.
- Used Apache Kafka to aggregate web log data from multiple servers and make them available in downstream systems for Data analysis and engineering type of roles.
- Used Apache Zookeeper for configuration management and cluster coordination services.
- Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Involved in performance tuning of Spark jobs using Cache and complete advantage of cluster environment.
- Designed Columnar families in Cassandra and Ingested data from RDBMS, performed data transformations, and then exported the transformed data to Cassandra as per the business requirement.
- Tested the cluster Performance using Cassandra-stress tool to measure and improve the Read/Writes.
- Developed Sqoop Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
- Developed Oozie coordinators to schedule Pig and Hive scripts to create Data pipelines.
- Written several Map reduce Jobs using Java API, also Used Jenkins for Continuous integration
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's.
- Implemented Elastic Search on Hive data warehouse platform.
- Used Hive QL to analyze the partitioned and bucketed data, executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
- Implemented ETL standards utilizing proven data processing patterns, migrated tools from Informatica to Talend
Environment: Hadoop, Spark, Spark-Streaming, Spark SQL, AWS, HDFS, Hive, Pig, Apache Kafka, Sqoop, Java (JDK SE 6, 7), Scala, Shell scripting, Linux, MySQL, Jenkins, Oracle, Oozie, MySQL, NIFI, Cassandra.
Sr. Hadoop Developer
Confidential, Norwalk, CT
- Involved in Installation, Configuring and managing Hadoop cluster using the Cloudera distribution CDH 5.0 and Continuous monitoring of the Hadoop cluster using the Cloudera manager.
- Using Kafka and Kafka brokers we initiated spark context and processed live streaming information with the help of RDD as is.
- Storing schema of incoming data sources in Schema registry of Kafka which will be utilized by the downstream applications.
- Real time processing of raw data stored in Kafka and storing processed data in Hadoop using Spark Streaming (Streams).
- Responsible for loading Data pipelines from webservers using Sqoop with Kafka and Spark Streaming API.
- In pre-processing phase used Spark RDD transformations to remove all the missing data and to create new features.
- Developed Spark SQL queries for generating statistical summary and filtering/aggregation operations for specific use cases working with Spark RDD's on distributed cluster running Apache Spark.
- Involved in converting SQL queries into Apache Spark transformations using Apache Spark DataFrames.
- As a part of Data acquisition, used Sqoop and flume to inject the data from server to Hadoop using incremental import.
- Configured Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS.
- Installed, configured and managed Cassandra database and performed read / writes using Java JDBC connectivity.
- Ingested data from Relational databases like MYSQL and Oracle DB2 into HDFS using Sqoop and ingesting them into Cassandra and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Used Sqoop to import the data from databases to Hadoop Distributed File System (HDFS) and performed automated data auditing to validate the accuracy of the loads.
- Developed Oozie Scheduler jobs for performing daily imports using Sqoop incremental imports from Relational databases that store data from upstream servers.
- Configuring, implementing and supporting High Availability (Replication) with Load balancing (sharding) cluster of MongoDB having TB's of data.
- Used Solr Search engine for performing full text searches with MongoDB as a data store.
- Created near Real Time Solr indexing on MongoDB (using Mongo connector from Mongo Labs) and HDFS using Solr Hadoop connector.
Environment: Apache Hadoop, HDFS, Cloudera, Sqoop, Apache Kafka, Oozie, SQL, Scala, Spark, Cassandra, MongoDB, Solr.
Confidential, Gulfport, MS
- Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive, Sqoop, Kafka and impala with Cloudera distribution.
- Involved in complete Implementation lifecycle, specialized in writing custom Pig and Hive queries.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Experienced in using Impala for faster processing of large datasets from Hadoop clusters and integrated it with BI tools to run ad-hoc queries directly on Hadoop .
- Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Experience in developing customized UDF's in java to extend Hive and Pig Latin functionality.
- Created HBase tables to store various data formats of data coming from different sources.
- Managing and scheduling Jobs to remove the duplicate log data files in HDFS using Oozie.
- Developed Flume ETL job for ingesting log data from HTTP Source to HDFS sink.
- Used Flume extensively in gathering and moving log data files from Application Servers to HDFS.
- Implemented test scripts to support test driven development and continuous integration.
- Dumped the data from HDFS to MYSQL database and vice-versa using Sqoop.
- Used File System check (FSCK) to check the health of files in HDFS.
- Developed the UNIX shell scripts for creating the reports from Hive data.
- Involved in the pilot of Hadoop cluster hosted on Amazon Web Services (AWS).
- Extensively used Sqoop to get data from RDBMS sources like Teradata and Netezza.
- Extracted files from MongoDB through Sqoop and placed in HDFS for processing.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to MongoDB as per the business requirement.
- Experience with creating script for data modeling and data import and export. Extensive experience in deploying, managing and developing MongoDB clusters.
Environment: Apache Hadoop, Map Reduce, HDFS, Ambari, Hive, Sqoop, Apache Kafka, Oozie, SQL, Flume, Scala, Spark, Java, AWS, GitHub.
Confidential, Mobile, AL
- Installed and configured Hadoop Ecosystem components and Cloudera manager using CDH distribution .
- Developed multiple Map Reduce jobs using Core Java for data cleansing and preprocessing.
- Developed Sqoop scripts to import/export data from Oracle to HDFS and into Hive tables.
- Worked on collecting and aggregating substantial amounts of log data using Flume and staging data in HDFS.
- Worked on analyzing Hadoop clusters using Big Data Analytic tools including Map Reduce , Pig and Hive .
- Involved in creating tables in Hive and writing Hive queries (HQL) to load data into Hive tables from HDFS.
- Optimized the Hive tables using partitions and bucketing to give better performance for Hive QL queries.
- Worked on Hive / Hbase vs RDBMS, imported data to hive, created internal and external tables, partitions, indexes, views, queries and reports for BI data analysis.
- Developed Java custom record reader, partition and serialization techniques.
- Developed interactive shell scripts for scheduling various data cleansing and data loading process
- Used different data formats (Text format and Avro format) while loading the data into HDFS.
- Created tables in HBase and loading data into HBase tables.
- Developed scripts to load data from HBase to Hive Meta store and perform Map Reduce jobs.
- Developed Custom Loaders and Storage Classes in PIG to work on several data formats like JSON, XML, CSV and generated Bags for processing using pig.
- Created custom UDF's in Pig and Hive .
- Created partitioned tables and loaded data using both static partition and dynamic partition methods.
- Installed Oozie workflow engine and scheduled it to run data/time dependent Hive and Pig jobs
- Designed and developed Dashboards for Analytical purposes using Tableau .
- Analyzed the Hadoop log files using Pig scripts to oversee the errors.
Environment: HDFS, Map Reduce, Hive, Sqoop, Pig, HBase, Oozie, CDH distribution, Java, Eclipse, Shell Scripts, Tableau, Windows, Linux.
Confidential, New York
- Involved in requirement gathering, Business Analysis and translated business requirement into technical design in Hadoop and Big data
- Configure and working with multi nodes Hadoop cluster, Installed Cloudera, Apache Hadoop, Hive, Pig and Spark and commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration.
- Responsible for analyzing Hadoop cluster and different big data analytic tools including Pig, Hive and Spark.
- Importing and exporting data into HDFS from different database and vice versa using SQOOP . loading data from Local file system to HDFS and HDFS to LINUX file system.
- Perform architecture design, data modeling, and implementation of SQL , Big Data platform and analytic applications for the consumer products.
- Implemented test scripts to support test driven development and continuous integration.
- Working on troubleshooting, monitoring, tuning the performance of Map reduce Jobs.
- Responsible to manage data coming from different sources.
- Transformed the data by applying ETL processes using Hive with large sets of structured and semi structured data.
- Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports using SQL BID or Data Tools.
- Managing different jobs using Fair scheduler.
- Using PIG predefined functions to convert the fixed width file to delimited file.
- Responsible for Optimizing and tuning Hive, Pig and Spark to improve performance and solve performance related issues in Hive and Pig scripts with good understanding of Joins, Group and aggregation.
- Querying Spark code using Scala and Spark-SQL for faster testing and data processing.
- Export and Import the data from different sources like HDFS/HBase into SparkRDD and spark to different scores.
- Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala.
- Utilized kafka for messaging and subscribing to topic, where the producer produce a topic and consumer consumes the data via subscription.
- Maintain various databases for Production, and in Installation, Configuration and up gradation of SQL Server 2008r 2/2012/2014/2016 technology, service packs and hot fixes for MS SQL Server 2008/2012.
- Responsible and maintenance of different level SQL Server High availability solutions with SQL Server Failover Clustering, Replication, Database Snapshot, Log shipping, and Database Mirroring.
- Responsible for Capacity planning, immediate performance solution, Performance Tuning, Troubleshooting, Disaster Recovery, backup and restore procedures.
- Solid hand on performance monitoring with Activity monitor, SQL Profiler, Performance monitor, Database Tuning Advisor, DMVs and SQL Diagnostic Manager.
- Responsible for implementing different types of Replication Models such as Transactional, Snapshot, Merge and Peer to Peer.
- Extract, Transform, and Load data (ETL) from Big data sources to SQL Server using SQL Server Integration Services (SSIS) Packages on BID and SQL Data Tools.
- Performance tuning of Queries and Stored Procedures using graphical execution plan, Query analyzer, monitor performance using SQL Server Profiler and Database Engine Tuning advisor.
- Conducted root cause analysis of application availability and narrow down the issues related to coding practices, Database Bottlenecks, or Network Latency.
- Create the MS SSIS packages for executing the required tasks. Created the Jobs and scheduled for daily running.
- Experience in table/indexing, partitioning and full text search and tuning the Production server to get the performance improvement.
- Maintaining both DEV/QA/Test and PRODUCTION Servers in sync. Installed and reviewed SQL server patches as well as service packs.
Environment: Hadoop, HDFS, MR, Hive, Pig, Spark, Sqoop, HBase, Java, Scala, Shell Scripting, Linux Red Hat. Multi node Hadoop cluster, SQL Cluster, SQL Server 2012/2014(SQL Server Management Studio.