- 7+ years of professional experience in IT, including Big Data, Hadoop. Well versed in installation, configuration, supporting and managing of Big Data and underlying infrastructure of Hadoop Cluster.
- 4+years of work experience in ingestion, storage, querying, processing, and analysis of Bigdata with hands on experience in Big data Eco - system related technologies like Map Reduce, Hive, Spark, Cloudera Navigator, Mahout, HBase, Pig, Zookeeper, Scoop, Flume, Oozie and HDFS.
- Good experience in writing Map Reduce jobs using Java native code, Pig, Hive for various business use cases.
- Excellent understanding of Hadoop architecture and different demons of Hadoop clusters which include Job Tracker, Task Tracker, Name Node and Data Node.
- Worked with Amazon Web Services ( AWS ), Amazon S3 to download ad-hoc feed data
- Experience in developing scalable solutions using NoSQL databases including HBASE, CASSANDRA, MongoDB and Couch DB.
- Responsible for performing reads and writes in Cassandra from and web application by using java JDBC connectivity.
- Extracted files from NoSQL database like Couch DB, HBase through Flume and placed in HDFS for processing.
- Handled Data Movement, data transformation, Analysis and visualization across the lake by integrating it with various tools.
- Strong hold on Informatica PowerCenter, Oracle, Vertica, Hive, SQL Server, Shell scripting and QlikView.
- Very Good understanding and Working Knowledge of Object-Oriented Programming (OOPS), Python and Scala.
- In depth understanding/knowledge of Hadoop Architecture & various components such as HDFS, MR, Hadoop GEN2 Federation, High Availability and YARN architecture and good understanding of workload management, scalability and distributed platform architectures.
- Have good experience in using Pig Latin operators such as LOAD, STORE, DUMP, FILTER, DISTINCT, FOREACH, GENERATE, GROUP, COGROUP, ORDER, LIMIT, UNION, SPLIT to extract data from data files to load into HDFS.
- Good experience working with different Hadoop file formats like Sequence File, RC File, ORC, AVRO and Parquet.
- Extensively used Java and J2EE technologies like Core Java, Java Beans, Servlet, JSP, spring, Hibernate, JDBC, JSON Object, and Design Patterns.
- Good experience working on Tableau & Spotfire & enabled the JDBC/ODBC data connectivity from those to Hive tables.
- Deep understanding of Tableau features such as site and server administration, Calculated fields, Table calculations, Parameters, Filter's (Normal and quick), highlighting, Level of detail, Granularity , Aggregation , Reference line and many more .
- Working knowledge of Scrum , Agile and Waterfall methodologies.
- Extensively experience in working on IDEs like Eclipse, Net Beans and Edit Plus.
- Good working experience in designing Oozie workflows for cleaning data and storing into Hive tables for quick analysis.
- Integrated Oozie with Hue and scheduled workflows for multiple Hive , Pig and Spark Jobs.
- In-Depth knowledge of Scala and Experience building Spark applications using Scala .
- Import the data from different sources like HDFS/HBase into Spark RDD.
- Knowledge of processing and analysing real-time data streams/flows using Kafka and HBase.
- Experience in working with Apache Sqoop to import and export data to and from HDFS and Hive.
- Responsible for performing extensive data validation using HIVE Dynamic Partitioning and Bucketing.
- Experience in using Apache Sqoop to import and export data to and from (different sources) HDFS and Hive.
- Experience in writing UDF’s (User Defined Functions) to enhance the functionalities of Hive and Pig.
- Possess strong communication and Interpersonal skills. Proven success in initiating, promoting and maintaining strong interpersonal relations. Can quickly master and work on new concepts and applications with minimal supervision.
Big Data Technologies: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, Zookeeper, Spark, Storm, Drill, Ambari, Mahout, MongoDB, Cassandra, Avro, Parquet and Snappy.
Programming Languages: Java, Python, Scala, C, C++, MATLAB, SAS, PHP, SQL, PL/SQL.
Oracle 11g/10g, DB2, MS: SQL Server, MySQL, Teradata.
ETL Tools: Cassandra, HBASE, ELASTIC SEARCH, Alteryx.
Operating Systems: Windows, UNIX, Linux, Mac OS.
Software Life Cycles: SDLC, Waterfall and Agile models
MSOffice, MS: Project and Risk Analysis tools, Visio
Utilities/Tools: Eclipse, Tomcat, NetBeans, JUnit, SQL, SOAP UI, ANT, Maven, Automation and MR-Unit
Cloud Platforms: Amazon EC2
Version Control: CVS, Tortoise SVN
Reporting Tools: Tableau
App/Web servers: WebSphere, WebLogic, and Tomcat
Senior Hadoop Developer
Confidential, Santa Ana, CA
- Designed and Developed Data lake Enterprise layer gold confirmed process which is available for the consumption team and business users to perform analytics.
- Responsible for designing and developing an automated framework which creates automates the development process in Data Lake.
- Integrated Talend with HBase for storing the processed Enterprise Data into separate column families and column qualifiers.
- Used CronTab and Zena scheduling to schedule trigger jobs in production.
- Worked with cross functional consulting teams within the data science and analytics team to design, develop, and execute solutions to derive business insights and solve clients' operational and strategic problems.
- Involved in migration of Teradata queries into the snowflake Data warehouse queries.
- Worked in Agile Scrum model and involved in sprint activities.
- Gathering and Analysis of Business requirements.
- Worked in Various Talend Integrations with Hbase and Avro Format, Hive, Phoenix and Pig Components
- Worked with GitHub, Zena, Jira, Jenkins Tools and deployed the projects in to production environments
- Involved in Cluster coordination services through Zookeeper.
- Worked on Integration with Phoenix Thick and thin clients and also involved in installing and developing Phoenix-Hive, Hive-Hbase integrations.
- Wrote UNIX Automated Shell scripts and developed an automation framework with Talend and UNIX.
- Created Merge, Update, Delete Scripts in Hive and worked on performance tuning Joins in Hive.
- Extensive Working knowledge of partitioned table, UDFs, performance tuning, compression-related properties, thrift server in Hive.
- Involved in writing optimized Pig Script along with involved in developing and testing Pig Latin Scripts.
- Working knowledge in writing Pig's Load and Store functions.
- Good experience in writing data ingesters and complex MapReduce jobs in java for data cleaning and pre-processing and fine tuning them as per data sets .
Environment: Hadoop, HDFS, Hive, QlikView, UNIX shell scripting, Hue, Hbase, Avro Format, Phoenix, Talend, Snowflake
Senior Bigdata/Hadoop Developer
Confidential, Columbus, OH
- Key contributor in designing batch mode database creation using HBase table which serves as unified source of information for all prospecting channels. The batch mode database build is done once every month as per the campaign calendar.
- Developed various Map reduce and hive jobs to load data from lead sources (called listcodes) to the 3 HBase tables called PIN, EDB and MDB. PIN DB has Personal Data of the customer. EDB, Executive Database holds executive information at a business level and MDB, Master Database, holds complete information of the Businesses.
- Map Reduce jobs to handle various arbitrations which scan HBase table, perform various business operations (such as finding Top Authorizing Officers, Top Businesses, Multiple Business owners, Time Series of Sales, employees etc.) and generate various arbitrated (calculated) columns. Which will be loaded back to the HBase table.
- Developed hive extract job to create Hive tables from the HBase table. The Business team and the Data Analysts use the Hive tables to perform ad-hoc queries on the data.
- Took active part in minimizing the total build time of the DB by combining various HBase loads from different list codes in a single Map reduce job.
- Participated in various design discussions to move from batch loading of the HBase DB to Real time loading of the DB.
- Used Kafka as the streaming platform to hold the listcode data as and when the new file arrives.
- Used Spark streaming with Scala to consume the real-time data from the Kafka cluster and run several arbitrations. The base data and the arbitrated data is loaded into HBase tables PIN, EDB and MDB.
- Used Spark Dataframes to represent the listcode data.
- Used Spark SQL and joins to arbitrate the data as per the individual business logic.
- Downloaded feeds from Amazon S3 and ran QC programs to find the data statistics. Extensive data validation using HIVE and also written Hive UDFs.
- Involved in creating Hive tables loading with data and writing hive queries which will run internally in map reduce way
- Lots of scripting (python and shell) to provision and spin up virtualized Hadoop clusters
- Adding, Decommissioning and rebalancing nodes.
- Expert in optimizing performance in hive using partitions and bucketing concepts.
Environment: Mapr 5.2, Java, Map reduce, YARN, HBase, Hive, Kafka, Spark, Scala, Zookeeper, AWS.
Confidential, Columbus, GA
- In depth understanding knowledge of Hadoop architecture and various components such as HDFS , application manager, node master, resource manager name node, data node and map reduce concepts.
- Involved in moving all log files generated from various sources to HDFS for further processing through flume .
- Imported required tables from RDBMS to HDFS using Sqoop and used Storm and Kafka to get real time streaming of data into HBase .
- Involved in creating hive tables and loading with data writing hive queries that will run internally in a map reduce way.
- Good experience with NoSQL database HBase and creating HBase tables to load large set of semi structured data coming from various sources.
- Involved in moving all log files generated from various sources to HDFS for further process through flume .
- Implemented the workflows using apache framework to automate tasks.
- Written Map Reduce code that will take input as log files and parse the structures in tabular format to facilitate effective querying on the log data.
- Developed java code to generate, compare & merge AVRO schema files.
- Developed complex map reduce streaming jobs using java language that are implemented using hive and pig .
- Used hive to analyse the partitioned and bucketed data and compute various metrics for reporting.
- Used hive optimization techniques during joins and best practices in writing hive scripts using HiveQL .
- Importing and exporting data into HDFS and hive using Sqoop .
- Writing the HIVE queries to extract the data processed.
- Developed data pipeline using flume, Sqoop, pig and map reduce to ingest customer behavioural data and purchase histories into HDFS for analysis.
- Implemented Spark using Scala and utilizing Spark core, Spark streaming and Spark SQL API for faster processing of data instead of Map reduce in Java .
- Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive tables handled structured data using Spark SQL .
- Developed Pig Latin scripts to extract the data from the web server out files to load into HDFS .
- Created HBase tables to store variable data formats of data coming from different legacy systems.
- Used HIVE to do transformations, event joins and some pre-aggregations before storing the data onto HDFS .
- Good understanding of Cassandra architecture, replication strategy, gossip, snitches etc.
- Expert knowledge on MongoDB NoSQL data modelling, tuning, disaster recovery and backup.
Environment: Hadoop , HDFS, MapReduce, Hive, Python, PIG, Java, Oozie, HBASE, Sqoop, Flume, MySQL.
Confidential, Albany, NY
- Installed and configured fully distributed Hadoop cluster.
- Performed Hadoop cluster environment administration that includes adding and removing cluster nodes, cluster capacity planning, performance tuning, cluster monitoring, and trouble shooting.
- Extensively used Cloud era Manager to manage the Hadoop cluster.
- Configured HiveMetastore, which stores the metadata for Hive tables and partitions in a relational database.
- Responsible for developing data pipeline using HD Insight, flume, Sqoop and pig to extract the data from weblogs and store in HDFS.
- Configured Flume for efficiently collecting, aggregating and moving large amounts of log data.
- Used Oozie to automate/schedule business workflows which invoke Sqoop, MapReduce & Pig jobs as per the requirements.
- Developed Sqoop scripts to import and export the data from relational sources.
- Worked with various HDFS file formats like Avro, Sequence File and various compression formats like Snappy, bzip2.
- Developed efficient MapReduce programs for filtering out the unstructured data.
- Developed the PigUDF's to pre-process the data for analysis.
- Developed Hive queries for data sampling and analysis to the analysts.
- Loaded data into the cluster from dynamically generated files using Flume and from relational database management systems using Sqoop.
- Developed custom UNIX SHELL scripts to do pre and post validations of master and slave nodes, before and after configuring the name node and data nodes respectively.
- Involved in HDFS maintenance and administering it through Hadoop-Java API.
- Supported Map Reduce Programs those are running on the cluster.
- Identified several PL/SQL batch applications in General Ledger processing and conducted performance comparison to demonstrate the benefits of migrating to Hadoop.
- Involved in implementing several POC's that demonstrate the advantages Businesses gain by migrating to Hadoop.
Environment: Red Hat Linux 5, MS SQL Server, Mongo DB, Oracle, Hadoop CDH 3/4/5, PIG, Hive, Zookeeper, HDFS, HBase, Sqoop, Python, Java, Oozie, Hue, Tez, UNIX Shell Scripting, PL/SQL, Maven, Ant
- Worked on analysing Hadoop cluster using different big data analytic tools including Pig, Hive, and Map Reduce.
- Collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Worked on debugging, performance tuning of Hive & Pig Jobs.
- Created HBase tables to store various data formats of data coming from different portfolios.
- Implemented test scripts to support test driven development and continuous integration.
- Worked on tuning the performance Pig queries.
- Cluster co-ordination services through Zookeeper.
- Experience in managing development time, bug tracking, project releases, development speed, release forecast, scheduling and many more.
- Involved in loading data from LINUX file system to HDFS.
- Importing and exporting data into HDFS and Hive using Sqoop.
- Developed Java program to extract the values from XML using XPaths.
- Experience working on processing unstructured data using Pig and Hive.
- Supported Map Reduce Programs those are running on the cluster.
- Gained experience in managing and reviewing Hadoop log files.
- End-to-end performance tuning of Hadoop clusters and Hadoop Map/Reduce routines against very large data sets.
- Implemented test scripts to support test driven development and continuous integration.
- Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs.
- Assisted in monitoring Hadoop cluster using tools like Cloudera Manager.
- Experience in optimization of Map reduce algorithm using combiners and partitions to deliver the best results and worked on Application performance optimization for a HDFS.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
Environment: Hadoop (CDH4), Map-Reduce, HBase, Hive, Sqoop, Oozie.
- Used JSP pages through Servlets Controller for client-side view.
- Always used the best practices of Java/J2EE to minimize the unnecessary object creation.
- Implement Restful web services with the Struts framework.
- Experience in providing Logging, Error handling by using Event Handler, and Custom Logging for SSIS Packages.
- Resolved product complications at customer sites and funnelled the insights to the development and deployment teams to adopt long term product development strategy with minimal roadblocks.
- Verify them with the J Unit testing framework.
- Working experience in using Oracle 10g backend Database.
- Used JMS Queues to develop Internal Messaging System.
- Developed the UML Use Cases, Activity, Sequence and Class diagrams using Rational Rose .
- Developed Java, JDBC, and Java Beans using JBuilder IDE .
- Developed JSP pages and Servlets for customer maintenance.
- Apache Tomcat Server was used to deploy the application.
- Involved in Building the modules in Linux environment with ant script.
- Used Resource Manager to schedule the job in UNIX server .
- Performed Unit testing, Integration testing for all the modules of the system.
- Developed JAVA BEAN components utilizing AWT and SWING classes.
- Reverse engineered legacy systems, analyzed and documented various existing workflows.
- Part of collaborating with business teams to consolidate on various new requirements.
- Assisted in creating various design documents like class diagrams and sequence diagrams. Used confluence for creating design documents.
- Part of designing modular application based on micro services architecture.
- Implemented various backend modules which collaborate with each other using restful web services.
- Designed Restful URLs for various modules and implemented corresponding endpoints using Spring MVC Technology
- Defined XSDs for various payloads and created JAXB objects from XSDs
- Defined DAO interfaces and added documentation to define contracts.
- Using DAO design pattern coded DAO implementations to DAO contract interfaces.
- Following test first methodology, wrote unit test cases for DAO using in memory data base - H2
- Implemented JMS components, sender & receiver, with an aim of achieving asynchronous communication, high throughput between sub modules.
- As part of legacy system maintenance, fixed bugs, made enhancements, added new service APIs , added new features to UI
- Deployed builds on apache tomcat.
- Used ECLIPSE as IDE,MAVEN for build management, JIRA for issue tracking, CONFLUENCE for documentation purpose, GIT for version controlling, ARC (Advanced Rest Client) for endpoint testing, CRUICIBLE for code review and SQL Developer as DB client.
- Used IBM MQ Series as the JMS provider.
- Responsible for writing SQL Queries and Procedures using DB2.
- Connection with Oracle, MySQL Database is implemented using Hibernate ORM. Configured hibernate, entities using annotations from scratch.
Environment: Core java, Spring MVC, Spring Security, Spring JMS, Spring JDBC template, XML, Log4J, Apache Tomcat, Active MQ, HTML, CSS, Bootstrap, Java script, Jira, Confluence.