- Above 9+ years of experience in IT industry, including Big data environment, Hadoop ecosystem and Design, Developing, Maintenance of various applications.
- Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
- Expertise in core Java, JDBC and proficient in using Java API's for application development.
- Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.
- Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web - based dashboards and reports.
- Good working experience in Application and web Servers like JBoss and Apache Tomcat.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
- Expertise in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL
- Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
- Experience in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and Map Reduce open source tools.
- Experience in installation, configuration, supporting and managing Hadoop clusters.
- Experience in working with Map Reduce programs using Apache Hadoop for working with Big Data.
- Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
- Strong hands on experience with AWS services, including but not limited to EMR, S3, EC2, route 53, RDS, ELB, Dynamo DB, Cloud Formation, etc.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
- Worked on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQL Server, Teradata and Netezza using Sqoop.
- Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing - Map Reduce framework.
- Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, NetBeans
- Expert in Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
- Experience in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4 & CDH5 clusters.
- Experience in working with different data sources like Flat files, XML files and Databases.
- Experience in database design, entity relationships, database analysis, programming SQL, stored procedures PL/ SQL, packages and triggers in Oracle.
Big Data Ecosystem: MapReduce, HDFS, HIVE 2.3, Hbase 1.2 Pig, Sqoop, Flume 1.8, HDP, Oozie, Zookeeper, Spark, Kafka, storm, Hue Hadoop Distributions Cloudera (CDH3, CDH4, CDH5), Hortonworks
Cloud Platform: Amazon AWS, EC2, Redshift
Version Control: Oracle 12c/11g, MySQL, MS-SQL Server2016/2014
Java/J2EE Technologies: Servlets, JSP, JDBC, JSTL, EJB, JAXB, JAXP, JMS, JAX-RPC, JAX- WS
NoSQL Databases: HBase and MongoDB
Programming Languages: Java, Python, SQL, PL/SQL, AWS, HiveQL, UNIX Shell Scripting, Scala.
Methodologies: Software Development Lifecycle (SDLC), Waterfall Model and Agile, STLC (Software Testing Life cycle) & UML, Design Patterns (Core Java and J2EE)
Web Technologies: Windows, UNIX/Linux and Mac OS.
Operating Systems: Windows, UNIX/Linux and Mac OS.
Build Management Tools: Maven, Ant.
IDE & Command line tools: Eclipse, Intellij, Toad and NetBeans.
Sr. Big Data Developer
Confidential - Sun Prairie, WI
- Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
- Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Worked with the Spark for improving performance and optimization of the existingalgorithms in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
- Used Spark Streaming APIs to perform transformations and actions on the fly for buildingcommon learner data model which gets the data from Kafka in Near real time and persist it to Cassandra.
- Developed Kafka consumer's API in Scala for consuming data from Kafka topics.
- Consumed XML messages using Kafka and processed the xml file using Spark Streaming to capture UI updates.
- Developed Preprocessing job using Spark Data frames to flatten Json documents to flat file.
- Load D-Stream data into Spark RDD and do in memory data Computation to generate Output response.
- Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Imported data from AWS S3 into Spark RDD, Performed transformations and actions on RDD's
- Gathered the business requirements from the Business Partners and Subject Matter Experts
- Developed environmental search engine using JAVA, ApacheSOLR and Cassandra.
- Managed works including indexing data, tuning relevance, developing custom tokenizers and filters, adding functionality includes playlist, custom sorting and regionalization with SOLR Search Engine.
- Ingested data from RDBMS and performed data transformations, and then export the transformed data to Cassandra as per the business requirement.
- Written multiple MapReduce programs for data extraction, transformation and aggregation from multiple file formats including XML, JSON, CSV& other compressed file formats.
- Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data.
- Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms
- Developed PIGUDFs to provide Pig capabilities for manipulating the data according to Business Requirements and worked on developing custom PIG Loaders and Implemented various requirements using Pig scripts.
- Experienced on loading and transforming of large sets of structured, semi structured and unstructured data
- Created POC using SparkSql and Mlib libraries.
- Developed a Spark Streaming module for consumption of Avro messages from Kafka.
- Implementing different machine learning techniques in Scala using Scala machine learning library, and created POC using SparkSql and Mlib libraries.
- Experienced in Querying data using SparkSQL on top of Spark Engine, implementing Spark RDD's in Scala.
- Expertise in writing Scala code using Higher order functions for iterative algorithms in Spark for Performance considerations.
- Experienced in managing and reviewing Hadoop log files
- Worked with different File Formats like TEXTFILE, AVROFILE, ORC, and PARQUET for HIVE querying and processing.
- Create and Maintain Teradata Tables, Views, Macros, Triggers and Stored Procedures
- Monitored workload, job performance and capacity planning using Cloudera Distribution.
- Worked on Data loading into Hive for DataIngestion history and Data content summary.
- Involved in developing Impala scripts for extraction, transformation, loading of data into data warehouse.
- Used Hive and Impala to query the data in HBase.
- Created Impala tables and SFTP scripts and Shell scripts to import data into Hadoop.
- Developed Hbasejava client API for CRUD Operations.
- Created Hive tables and involved in data loading and writing HiveUDFs. Developed HiveUDFs for rating aggregation
- Generated JavaAPIs for retrieval and analysis on No-SQL database such as HBase and Cassandra
- Provided ad-hoc queries and data metrics to the Business Users using Hive, Pig
- Did various performance optimizations like using distributed cache for small datasets, partition and bucketing in hive, doing mapsidejoins etc
- Worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop for analysis, visualization and to generate reports.
- Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS
- Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances.
- Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
- Used EMR (Elastic Map Reducing) to perform bigdata operations in AWS.
- Worked on Apache spark writing python applications to convert txt, xls files and parse.
- Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Installed the application on AWS EC2 instances and configured the storage on S3 buckets.
- Loading data from different source (database & files) into Hive using Talend tool.
- Implemented Spark using Python/Scala and utilizingSpark Core, Spark Streaming and Spark SQL for faster processing of data instead of MapReduce in Java
- Experience in integrating Apache Kafka with Apache Spark for real time processing.
- Exposure on usage of Apache Kafka develop data pipeline of logs as a stream of messages using producers and consumers.
- Scheduled Oozie workflow engine to run multiple Hive and Pigjobs, which independently run with time and data availability
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV etc
- Involved in running Hadoop Streaming jobs to process Terabytes of data
- Used JIRA for bug tracking and CVS for version control.
Confidential, Austin, Texas
Sr. Big Data Developer
- Contributing to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop and other big data technologies for leading organizations using major Hadoop Distributions like Hortonworks.
- Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
- Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
- Created Hive external tables on the MapReduce output before partitioning; bucketing is applied on top of it.
- Developed business specific Custom UDF's in Hive, Pig.
- Developed end to end architecture design on big data solution based on variety of business use cases
- Worked as a Spark Expert and performance Optimizer
- Member of Spark COE (Center Of Excellence) in Data Simplification project at Cisco
- Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
- Handled Data Skewness in Spark-SQL
- Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data
- Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data
- Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data also developed Spark jobs and Hive Jobs to summarize and transform data
- Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark
- Implemented Sqooping from Oracle to Hadoop and load back in parquet format
- Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data; Worked under Mapr Distribution and familiar with HDFS
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
- Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
- Designed and maintained Tez workflows to manage the flow of jobs in the cluster.
- Worked with the testing teams to fix bugs and ensured smooth and error-free code
- Preparation of docs like Functional Specification document and Deployment Instruction documents
- Fixed defects during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
- Involved in installing Hadoop Ecosystem components (Hadoop, MapReduce, Spark, Pig, Hive, Sqoop, Flume, Zookeeper and HBase).
- Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.
Environment: AWS S3, RDS, EC2, Redshift, Hadoop 3.0, Hive 2.3, Pig, Sqoop 1.4.6, Oozie, Hbase 1.2, Flume 1.8, Hortonworks, MapReduce, Kafka, HDFS, Oracle 12c, Microsoft, Java, GIS, Spark 2.2, Zookeeper
Confidential, Rocky Hill, CT
Sr. Spark/Hadoop Developer
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
- Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
- Developer full SDLC of AWS Hadoop cluster based on client's business need
- Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
- Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra)
- Responsible for importing log files from various sources into HDFS using Flume
- Analyzed data using HiveQL to generate payer by reports for transmission to payer's form payment summaries.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
- Performed data profiling and transformation on the raw data using Pig, Python, and Java.
- Developed predictive analytic using Apache Spark Scala APIs.
- Involved in working of big data analysis using Pig and User defined functions (UDF).
- Created Hive External tables and loaded the data into tables and query data using HQL.
- Implemented Spark GraphX application to analyze guest behavior for data science segments.
- Enhancements to traditional data warehouse based on STAR schema, update data models, perform Data Analytics and Reporting using Tableau.
- Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with CSV, JSON, parquet and HDFS files.
- Developed HiveQL scripts for performing transformation logic and also loading the data from staging zone to landing zone and Semantic zone.
- Involved in creating Oozie workflow and Coordinator jobs for Hive jobs to kick off the jobs on time for data availability.
- Worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.
- Exported the generated results to Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
- Managed and lead the development effort with the help of a diverse internal and overseas group.
Confidential, Tampa, FL
- Developed Sqoop scripts for the extractions of data from various RDBMS databases into HDFS.
- Developed scripts to automate the workflow of various processes using python and shell scripting.
- Collected and aggregate large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Wrote Hive join query to fetch info from multiple tables, writing multiple Map Reduce jobs to collect output from Hive
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Used AWS Cloud and On-Premise environments with Infrastructure Provisioning/ Configuration.
- Worked on writing Perl scripts covering data feed handling, implementing mark logic, communicating with web services through SOAP Lite module and WSDL.
- Developed data pipeline using Pig and Hive from Teradata, DB2 data sources. These pipelines had customized UDF'S to extend the ETL functionality.
- Used UDF's to implement business logic in Hadoop by using Hive to read, write and query the Hadoop data in HBase.
- Installed and configured Hadoop Ecosystem like Hive, Oozie, Sqoop by which implemented using Cloudera Hadoop cluster for helping with performance tuning and monitoring.
- Used Oozie workflow engine to run multiple Hive and Pig Scripts with the help of Kafka for the real-time processing of data to navigate through data sets in the HDFS storage by loading Log File data directly into HDFS using Flume.
- Developed an end-to-end workflow to build a real time dashboard using Kibana, Elastic Search, Hive and Flume.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard
- Involved in developing Map-reduce framework, writing queries scheduling map-reduce
- Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Using Oozie for designing workflows and scheduling various jobs in the Hadoop ecosystem.
- Developed Map Reduce programs in java for applying business rules on the data and optimizing them using various compression formats and combiners.
- Using SparkSQL to create data frames by loading JSON data and analyzing it.
- Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
Environment: Hadoop, Hive, Zookeeper, Map Reduce, Sqoop, Pig 0.10 and 0.11, JDK1.6, HDFS, Flume, Oozie, DB2, HBase, Mahout, Scala.
Confidential - Paso Robles, CA
Sr. Java/J2EE Developer
- Involved in a full life cycle Object Oriented application development - Object Modeling, Database Mapping, GUI Design.
- Developed the J2EE application based on the Service Oriented Architecture.
- Used Design Patterns like Singleton, Factory, Session Facade and DAO.
- Handled importing of data from various data sources, performed transformations using Hive and loaded data into HDFS.
- Developed applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Orchestrated hundreds of Sqoop scripts, python scripts, Hive queries using Oozie workflows and sub-workflows.
- Moved all crawl data flat files generated from various retailers to HDFS for further processing.
- Writing the script files for processing data and loading to HDFS.
- Worked on requirement gathering, analysis and translated business requirements into technical design with Hadoop Ecosystem.
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Creation of Java classes and interfaces to implement the system.
- Created Hive tables to store the processed results in a tabular format.
- Involved in gathering the requirements, designing, development and testing.
- Completely involved in the requirement analysis phase.
- Created External Hive Table on top of parsed data.
- Developed various complex Hive JQuery as per business logic.
- Developing complex hive queries using Joins and partitions for huge data sets as per business requirements and load the filtered data from source to edge node hive tables and validate the data.
- Performed bucketing and partitioning of data using apache hive which saves the processing time and generating proper sample insights.
- Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive queries.
- Used Log4j utility to generate run-time logs.
- Wrote SAX and DOM XML parsers and used SOAP for sending and getting data from the external interface.
- Deployed business components into WebSphere Application Server.
- Developed Functional Requirement Document based on users' requirement.
Environment: Core Java, J2EE, JDK 1.6, spring 3.0, Hibernate 3.2, Tiles, AJAX, JSP 2.1, Eclipse 3.6, IBM WebSphere7.0, XML, XSLT, SAX, DOM Parser, HTML, UML, Oracle10g, PL/ SQL, JUnit.
- Implemented Spring MVC architecture and Spring Bean Factory using IOC, AOP concepts.
- Gathered the requirements and designed the application flow for the application.
- Involved in writing Maven for building and configuring the application.
- Developed Action classes for the system as a feature of Struts.
- Performed both Server side and Client side Validations.
- Developed EJB component to implement business logic using Session and Message Bean.
- Used Spring Framework to integrate with Struts web framework, Hibernate.
- Extensively worked with Hibernate to connect to database for data persistence.
- Integrated Activate Catalog to get parts using JMS.
- Used Log4J log both User Interface and Domain Level Messages.
- Extensively worked with Struts for middle tier development with Hibernate as ORM and Spring IOC for Dependency Injection for the application based on MVC design paradigm.
- Created struts-config.xml file to manage with the page flow.
- Developed html views with HTML, CSS, and Java Script.
- Performed Unit testing for modules using JUnit.
- Played an active role in preparing documentation for future reference and upgrades.
- Responsible for delivering potentially shippable product increments at the end of each Sprint.
- Involved in Scrum meetings that allow clusters of teams to discuss their work, focusing especially on areas of overlap and integration.
Environment: Java 1.4, JSP, Servlets, Java Script,, HTML 5, AJAX, JDBC, JMS, EJB, Struts 2.0, Spring 2.0, Hibernate 2.0, Eclipse 3.x, WebLogic9, Oracle 9i, JUnit, Log4j