- 8+ years of experience in IT experience in software design, development, implementation and support of business applications for Telecom, health and Insurance industries
- Experience in Big data Hadoop, Hadoop Ecosystem components like MapReduce, Sqoop, Flume, Kafka, Pig, Hive, Spark, Storm, HBase, Oozie, and Zookeeper
- Worked extensively on installing and configuring Hadoop ecosystem components Hive, SQOOP, PIG, HBase, Zookeeper and Flume
- Good Knowledge in writing Spark Applictions in Scala Python(Pyspark)
- Hands of experience on build tools like Maven, Log4j, Junit and Ant
- Working with the data extraction, transformation and load using Hive, Pig and HBase
- Hands on Experience in designing and developing applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Experience in understanding the security requirements for Hadoop and integrate with Kerberos authentication and authorization infrastructure.
- Good understanding in Cassandra &MongoDB implementation.
- Implemented ETL operations using Big Data platform
- Hands on experience on Streaming data ingestion and Processing
- Experienced in designing different time driven and data driven automated workflows using Oozie.
- Conviva and MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in spark streaming.
- Expertise in writing the Real - time processing application Using spout and bolt in Storm.
- Experience in configuring various topologies in storm to ingest and process data on the fly from multiple sources and aggregate into central repository Hadoop.
- Highly Acumen in choosing an efficient ecosystem in Hadoop and providing the best solutions to Big Data problems.
- Well versed with Design and Architecture principles to implement Big Data Systems.
- Experience in configuring the Zookeeper to coordinate the servers in clusters and to maintain the data consistency
- Acumen on Data Migration from Relational Database to Hadoop Platform using SQOOP.
- Experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations.
- Good understanding of MPP databases such as HP Vertica and Impala.
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS
- Expertise in relational databases like Oracle, My SQL and SQL Server .
- Experienced in implementing projects in Agile and Waterfall methodologies.
- Well versed with Sprint ceremonies that are practiced in Agile methodology.
- Strong Experience on Data Warehousing ETL concepts using Informatica Power Center, OLAP, OLTP and AutoSys.
- Highly involved in all phases of SLDC with Analysis, Design, Development, Integration, Implementation, Debugging, and Testing of Software Applications in a client server environment, Object Oriented Technology and Web based applications.
- Strong analytical and problem solving skills, highly motivated, good team player with very Good communication & interpersonal skill
Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Ambari, Mahout, MongoDB, Cassandra, Avro, Storm, Parquet and Snappy.
Hadoop Distributions: Cloudera (CDH3, CDH4, and CDH5), Hortonworks, MapR and Apache
No SQL Databases: Cassandra, MongoDB and HBase
Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts
XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB
Methodology: Agile, waterfall
Development / Build Tools: Eclipse, Ant, Maven, IntelliJ, JUNIT and log4J.
Frameworks: Struts, spring and Hibernate
App/Web servers: WebSphere, WebLogic, JBoss and Tomcat
DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle
RDBMS: Teradata, Oracle 9i,10g,11i, MS SQL Server, MySQL and DB2
Operating systems: UNIX, LINUX, Mac os and Windows Variants
Data analytical tools: R and MATLAB
ETL Tools: Talend, Informatica, Pentaho
Sr. Hadoop/Spark Developer
Confidential, North Carolina
- Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Flume, Spark, Impala.
- Ingested the data from Relational Databases to HDFS using SQOOP
- Designed the Column families in Cassandra.
- Implemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Implemented intermediate functionalities like events or records count from the flume sinks or Kafka topics by writing Spark programs in java and Scala.
- Written Storm topology to accept the events from Kafka producer and emit into Cassandra DB
- Understanding of Kerberos authentication in Oozie workflow for hive and Cassandra
- Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS and Elastic Search.
- Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Worked towards creating real time data streaming solutions using Apache Spark/Spark Streaming, Kafka.
- Worked with Kafka, Flume for building robust and fault tolerant data Ingestion pipeline for transporting streaming data into HDFS.
- Experienced in transferring Streaming data, data from different data sources into HDFS, No SQL databases
- Worked on Talend ETL tool and used features like context variable and database components like input to oracle, output to oracle, tFile compare, tFile copy, to oracle close ETL components
- Created ETL Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database.
- Prepare a Teradata ETL design with Teradata Utilities and Informatica..
- Well versed with the Database and Data Warehouse concepts like OLTP, OLAP, Star Schema
- Developed code in Java which creates mapping in Elastic Search even before data is indexed into
- Maintained ELK (Elastic Search, Logstash, Kibana) and Wrote Spark scripts using Scala shell
- AWS provides a secure global infrastructure, plus a range of features that use to secure the data in the cloud
- Good experience of AWS Elastic Block Storage (EBS), different volume types and use of various types of EBS volumes based on requirement.
- Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, EBS, RDS and VPC.
- Implemented AWS provides a variety of computing and networking services to meet the needs of applications
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Developed multiple Kafka Producers and Consumers from scratch to as per the software requirement specifications.
- Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
- Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language Scala.
- Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
- Working as Cassandra architect, Expertise in Data stack Enterprise & Apache Cassandra, Cassandra Data Modeling, Cassandra installation & multi data center cluster set up with solr & spark enabled, automation using Ansible.
- Exposure to Mesos (DC/OS), Docker for Automation
- Utilized Container technology like Docker along with Mesos and aurora to manage whole cluster of hosts
- Implemented Docker containers and created clients respective Docker images and leveraged Apache Mesos to manage Cluster hosts for Applications
- Implemented custom apps in Docker on Mesos , managed by Marathon
- Experience in using Solr and Zookeeper technologies.
- Expert in benchmarking and load testing a Cassandra cluster using a Java-based stress testing utility called Cassandra-stress tool.
- Real time streaming the data using Spark with Kafka.
- In detail understanding of Cassandra cluster topology, Virtual and Manual nodes.
- Cassandra Cluster planning which includes Data sizing estimation, and identify hardware requirements based on the estimated data size and transaction volume
- Prepare solr queries & cql queries to support Micro Services integration with Cassandra cluster.
- Good Knowledge in using NiFi to automate the data movement between different Hadoop systems.
- Designed and implemented custom Nifi processors that reacted, processed for the data pipeline
- Experience designing and executing time driven and data driven Oozie workflows.
- Experience processing Avro data files using Avro tools and MapReduce programs.
- Worked on Cassandra in creating Cassandra tables to load large sets of semi structured data coming from various sources.
- Worked on different file formats (ORCFILE, TEXTFILE) and different Compression Codecs (GZIP, SNAPPY, LZO).
- Processed Multiple Data sources input to same Reducer using Generic Writable and MultiInput format.
- Involved in performance tuning of spark jobs using Cache and using complete advantage of cluster environment.
Environment: Hadoop, Hive, Flume, Map Reduce, Sqoop, Kafka, Spark, Yarn, Pig, Cassandra, Oozie, shell Scripting, Scala, Maven, Java, JUnit, agile methodologies, MySQL
- Worked extensively on Hadoop Components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN, Spark and Map Reduce programming
- Converting the existing relational database model to Hadoop ecosystem.
- Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
- Developed Schedulers that communicated with the Cloud based services (AWS) to retrieve the data.
- Strong experience in working with ELASTIC MAPREDUCE and setting up environments on Amazon AWS EC2 instances.
- Ability to spin up different AWS instances including EC2-classic and EC2-VPC using cloud formation templates.
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS
- Imported the data from different sources like AWS S3, LFS into Spark RDD.
- Experienced in working with Amazon Web Services (AWS) EC2 and S3 in Spark RDD
- Managed and reviewed Hadoop and HBase log files.
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud.
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Successfully deployed 1-hadoop cluster and it was integrated with Teradata using Teradata-Hadoop query grid as PoC.
- Loading data from large data files into Hive tables and retrieving data from both Teradata and Hive, by running queries from Bteq.
- Analyze table data and implement compression techniques like Teradata Multivalued compression
- Involved in ETL process from design, development, testing and migration to production environments.
- Involved in writing the ETL test scripts and guided the testing team in executing the test scripts.
- Involved in performance tuning of the ETL process by addressing various performance issues at the extraction and transformation stages.
- Writing Hadoop MapReduce jobs to run on Amazon EMR clusters and creating workflows for running jobs
- Generating analytics reporting on probe data by writing EMR (elastic map reduce) jobs to run on Amazon VPC cluster and using Amazon data pipelines for automation.
- Worked with Elastic MapReduce (EMR) on Amazon Web Services (AWS).
- Have good understanding of Teradata MPP architecture such as Partitioning, Primary Indexes,
- Good knowledge in Teradata Unity, Teradata Data Mover, OS PDE Kernel internals, Backup and Recovery
- Worked on huge datasets from Hive/Presto to understand and visualize the data for sales analysis of Top Executives within the Company.
- Created HBase tables to store variable data formats of data coming from different portfolios.
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop
- Creating Hive tables and working on them using HiveQL.
- Creating and truncating HBase tables in hue and taking backup of submitter ID
- Developed data pipeline using Kafka to store data into HDFS.
- Used Spark API over Hadoop YARN as execution engine for data analytics using Hive.
- Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
- Involved in review of functional and non-functional requirements.
- Developed ETL Process using HIVE and HBASE.
- Worked as an ETL Architect/ETL Technical Lead and provided the ETL framework Solution for the Delta process, Hierarchy Build and XML generation.
- Prepared the Technical Specification document for the ETL job development.
- Impact analysis to the existing ETL jobs as part of the FDW enhancements.
- Performance testing for the enhancements and SLA improvement of ETL jobs.
- Experience in configuring the Storm in loading the data from MYSQL to HBASE using jms
- Responsible to manage data coming from different sources.
- Loaded the CDRs from relational DB using Sqoop and other sources to Hadoop cluster by using Flume.
- Experience in processing large volume of data and skills in parallel execution of process using Talend functionality.
- Written column families in HBase
- Involved in loading data from UNIX file system and FTP to HDFS.
- Developed Hive queries to analyze the output data.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Designed Cluster co-ordination services through Zookeeper.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS.
- Prepared adhoc phoenix queries on Hbase.
- Created secondary index tables using phoenix on HBase tables
- Near Real Time Solr index on Hbase and Hdfs
- Performed data analysis with Hbase using Hive External tables
- Developed UDF's in java for enhancing functionalities of Pig and Hive scripts.
- Written MapReduce program for data validation
- Designed and implemented Spark jobs to support distributed data processing.
- Supported the existing MapReduce Programs those are running on the cluster.
- Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to any warning or failure conditions.
- Wrote Java code to format XML documents; upload them to Solr server for indexing.
- Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs and data.
- Followed agile methodology for the entire project.
- Installed and configured Apache Hadoop, Hive and Pig environment.
Environment: Hadoop, HDFS, pig, Hive, Flume, Sqoop, Oozie, Python, Shell Scripting, SQL Talend, Spark, HBase, Elastic search, Linux- Ubuntu, Kafka.
- Worked on importing data from various sources and performed transformations using Map Reduce, hive to load data into HDFS
- Responsible for building scalable distributed data solutions using Hadoop. Written various Hive and Pig scripts
- Created HBase tables to store variable data formats coming from different portfolios Performed real time analytics on HBase using Java API and Rest API.
- Deployed Hadoop Cluster in the following modes.
- Standalone, Pseudo-distributed and Fully Distributed
- Implemented Name Node backup using NFS. This was done for High availability
- Integrated NoSQL database like HBase with Map Reduce to move bulk amount of data into HBase.
- Logical implementation and interaction with HBase
- Assisted in creation of large HBase tables using large set of data from various portfolios
- Efficiently put and fetched data to/from HBase by writing MapReduce job
- Experienced with different scripting language like Python and shell scripts.
- Experienced in working with data analytics, web Scraping and Extraction of data in Python
- Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
- Worked as POC for initial MongoDb clusters, Hadoop clusters and various Teradata servers and successfully tested, on boarded and performed basic admin and dba tasks.
- Experience in handling the service requests with MongoDB Jira for resolving issues.
- Monitoring of Document growth and estimating storage size for large MongoDB clusters depends on data life cycle management.
- Querying the MongoDB database using JSON.
- Expert knowledge on MongoDb no sql data modeling, tuning, disaster recovery and backup.
- Designed & Implemented database Cloning using Python and Built backend support for Applications using Shell scripts
- Worked on Ansible Python API for controlling the nodes and extending various python events
- Implemented Frameworks using Java and python to automate the ingestion flow.
- Implemented Hive UDFs to validate against business rules before data move to Hive tables.
- Experienced with join different data sets using Pig join operations to perform queries using pig scripts.
- Experienced with Pig Latin operations and writing Pig UDF’s to perform analytics.
- Cluster coordination services through Zookeeper
- Implemented Unix shell scripts to perform cluster admin operations.
- Exported the analyzed data to the relational databases using Sqoop for visualization andto generate reports
- Worked on the Ingestion of Files into HDFS from remote systems using MFT (Managed File Transfer).
- Experience in monitoring and managing Cassandra cluster.
- Analyzed the weblog data using the HiveQL, integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as MapReduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Written Pig scripts for sorting, joining and grouping the data
- Experienced with working on Avro Data files using Avro Serialization system.
- Solved small file problem using Sequence files processing in Map Reduce.
- Developed Shell scripts to automate routine DBA tasks
Environment: HDFS, Map Reduce, Pig, Hive, Sqoop, Flume, HBase, Java, Maven, Avro, Cloudera, Eclipse and Shell Scripting
- Designed and implemented the training and reports modules of the application using Servlets, JSP and ajax
- Interact with Business Users and Develop Custom Reports based on the criteria defined.
- Requirement gathering and information collection. Analysis of gathered information so as to prepare a detail work plan and task breakdown structure
- Developed custom JSP tags for the application
- Involved in the phases of SDLC (Software Development Life Cycle) including Requirement collection, Design and analysis of Customer specification, Development and Customization of the application
- Used Quartz schedulers to run the jobs in a sequential with in the given time
- Strong hands on experience using Teradata utilities (FastExport, MultiLoad, FastLoad, Tpump, BTEQ and QueryMan).
- Familiar in Creating Secondary indexes, and join indexes in Teradata.
- Implementation of Physical modeling on Teradata such as creation of tables, indexes, views, normalization.
- Implemented the reports module applications using jasper reports for business intelligence
- Deployed application on tomcat server for business application in client location
- End-to-End System development and testing of Unit integration and System integration
- Co-ordination activities with Onshore and Offshore team of 10+ members
- Responsible for Effort estimation and timely production deliveries
- Creation and Execution of half yearly and yearly load jobs which updates new rate and discounts etc. for the claim calculations in Database and Files
- Extensively used Java multi-threading to implement batch Jobs with JDK 1.5 features
- Configured the project on Web Logic 10.3 application servers
- Implemented the online application using Core Java, JDBC, JSP, Servlets, spring, Hibernate, Web Services, SOAP, and WSD
- Involved in the analysis, design, implementation, and testing of the project.
- Implemented Singleton, Factory Design Pattern, DAO Design Patterns based on the application requirements
- Used SAX and DOM parsers to parse the raw XML documents
- Used RAD as Development IDE for web applications
- Developed mappings in Informatica to load the data including facts and dimensions from various sources into the Data Warehouse, using different transformations like Source Qualifier, JAVA, Expression, Lookup, Aggregate, Update Strategy and Joiner.
- Scheduling the sessions to extract, transform and load data in to warehouse database on Business requirements.
- Extensively used Java multi-threading to implement batch Jobs with JDK 1.5 features
- Configured the project on Web Logic 10.3 application servers
- Implemented the online application using Core Java, JDBC, JSP, Servlets, spring, Hibernate, Web Services, SOAP, and WSDL
- Deployed applications on JBoss 4.0 server
- Extensively configured the build files to different Environments
- Developed Session Beans which encapsulates the workflow logic
- Used Entity Beans to persist the data into database
- Used JMS to establish message communication
- Responsible for the performance PL/ SQL procedures and SQL queries
- Used CVS for the concurrent development in the team and for code repository
- Developed web components using JSP, Servlets and JDBC.
Environment: Java, J2EE, Servlets, Struts 1.1, Spring, JSP, JMS, JBoss 4.0, SQL Server 2000, Ant, CVS, PL/SQL, Hibernate, Eclipse, Linux