- Currently working in a Big Data Capacity with the help of Hadoop Eco System across internal and cloud - based platforms.
- Above 8+ years of experience as Big Data/Hadoop with skills in analysis, design, development, testing and deploying various software applications.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
- Good knowledge in using Hibernate for mapping Java classes with database and using Hibernate Query Language (HQL).
- Hands on experience in configuring and working with Flume to load the data from multiple sources directly into HDFS.
- Experience in developing custom UDF's for Pig and ApacheHive to in corporate methods and functionality of Java into PigLatin and HiveQL.
- Good experience in developing MapReduce jobs in J2EE /Java for datacleansing, transformations, pre-processing and analysis.
- Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2webservices which provides fast and efficient processing of TeradataBigData Analytics.
- Experience in collection of LogData and JSON data into HDFS using Flume and processed the data using Hive/Pig.
- Experience working with cloud platforms, setting up environments and applications on AWS, automation of code and infrastructure (DevOps) using Chef and Jenkins
- Extensive experience on developing SparkStreaming jobs by developing RDD's (Resilient Distributed Datasets) and used SparkSQL as required.
- Experience on developing JAVAMapReduce jobs for datacleaning and data manipulation as required for the business.
- Strong knowledge on Hadoopeco systems including HDFS, Hive, Oozie, HBase, Pig, Sqoop, Zookeeper etc.
- Extensive experience with advanced J2EE Frameworks such as spring, Struts, JSF and Hibernate.
- Installation, configuration and administration experience in Big Data platforms Cloudera Manager of Cloudera, MCS of MapR.
- Extensive experience in working with Oracle, MSSQLServer, DB2, MySQL.
- Experience working with Horton works and Cloudera environments.
- Good knowledge in implementing various data processing techniques using ApacheHBase for handling the data and formatting it as required.
- Excellent experience in installing and running various Oozieworkflows and automating parallel job executions.
- Experience on Spark and SparkSQL, SparkStreaming, SparkGraphX, SparkMlib.
- Extensively development experience in different IDE like Eclipse, NetBeans, IntelliJ and STS.
- Strong experience in coreSQL and Restfulwebservices (RWS).
- Strong knowledge in NOSQL column oriented databases like HBase and its integration with Hadoopcluster.
- Good experience in Tableau for DataVisualization and analysis on large datasets, drawing various conclusions.
- Good knowledge of coding using SQL, SQLPlus, T-SQL, PL/SQL, Stored Procedures/Functions.
- Worked on Bootstrap, AngularJS and NodeJS, knockout, ember, Java Persistence Architecture (JPA).
- Well versed working with Relational Database Management Systems as Oracle12c, MSSQL, MySQLServer.
- Experience with all stages of the SDLC and Agile Development model right from the requirement gathering to Deployment and production support.
- Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
Hadoop/Big Data Technologies: Hadoop 2.7/2.5, HDFS, MapReduce, HBase 1.2.4, Pig, Hive 2.0, Hue, Sqoop, Spark2.0/2.0.2, Impala, Oozie, YARN, Flume 1.7, Kafka, Zookeeper
Hadoop Distributions: Cloudera 5.9, Hortonworks, MapR
Programming Language: Java, Scala, Python 3.5, SQL, PL/SQL, Shell Scripting, Storm, JSP, Servlets
Frameworks: Spring 4.3, Hibernate, Struts, JSF, EJB, JMS
Databases: Oracle 12c/11g, SQL Server2016/2014, MYSQL5.7/5.4.16
Database Tools: TOAD, SQL PLUS, SQLite 3.15/3.15.2
Operating Systems: Linux, Unix, Windows 8/7
IDE and Tools: Eclipse 4.6, Netbeans 8.2, IntelliJ, Maven
NoSQL Databases: HBase, Cassandra, MongoDB, Accumulo
Web/Application Server: Apache Tomcat, Jboss, Web Logic, Web Sphere
SDLC Methodologies: Agile, Waterfall
Version Control: GIT, SVN, CVS
Confidential, New Jersey
Big Data Developer
- As a Big Data Developer, I worked on Hadoop eco-systems including Hive, HBase, Oozie, Yarn, Spark Streaming MCS (MapR Control System) and so on with MapR distribution.
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
- Primarily involved in Data Migration process using Azure by integrating with Bitbucket repository and Jenkins.
- Built code for real time data ingestion using Java, MapR-Streams (Kafka).
- Involved in various phases of development analyzed and developed the system going through Agile Scrum methodology.
- Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
- Worked on analyzing Hadoop stack and different Big data tools including Pig and Hive, Hbase database and Sqoop.
- Developed data pipeline using flume, Sqoop and pig to extract the data from weblogs and store in HDFS
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Worked with different data sources like Avro data files, XML files, JSON files, SQL server and Oracle to load data into Hive tables.
- Used J2EE design patterns like Factory pattern & Singleton Pattern.
- Used Spark to create the structured data from large amount of unstructured data from various sources.
- Implemented usage of Amazon EMR for processing Big Data across Hadoop Cluster of virtual servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3)
- Performed transformations, cleaning and filtering on imported data using Hive, MapReduce, Impala and loaded final data into HDFS.
- Developed Python scripts to find vulnerabilities with SQL Queries by doing SQL injection.
- Experienced in designing and developing POC's in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
- Responsible for coding MapReduce program, Hive queries, testing and debugging the MapReduce programs.
- Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra.
- Involved in the process of data acquisition, data pre-processing and data exploration of telecommunication project in Scala.
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
- Specified the cluster size, allocating Resource pool, Distribution of Hadoop by writing the specification texts in JSON File format.
- Imported weblogs & unstructured data using the Apache Flume and stores the data in Flume channel.
- Exported event weblogs to HDFS by creating a HDFS sink which directly deposits the weblogs in HDFS.
- Used RESTful web services with MVC for parsing and processing XML data.
- Utilized XML and XSL Transformation for dynamic web-content and database connectivity.
- Involved in loading data from UNIX file system to HDFS. Involved in designing schema, writing CQL's and loading data using Cassandra.
- Built the automated build and deployment framework using Jenkins, Maven etc.
Confidential, Atlanta, GA
Big data/Hadoop Developer
- Involved in Analysis, Design, System architectural design, Process interfaces design, design documentation.
- Responsible for developing prototypes the selected solutions and implementing complex big data projects with a focus on collecting, parsing, managing, analysing and visualizing large sets of data using multiple platforms.
- Understand how to apply technologies to solve big data problems and to develop innovative big data solutions.
- Developed Spark Applications by using Scala , Java and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.
- Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real time and persist it to Cassandra .
- Responsible for analysing and cleansing raw data by performing Hive queries and running Pig scripts on data.
- Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS .
- Developed Simple to complex Map Reduce Jobs using Hive and Pig .
- Performed importing data from various sources to the Cassandra cluster using Sqoop . Worked on creating data models for Cassandra from Existing Oracle data model.
- Used Spark - Cassandra connector to load data to and from Cassandra .
- Worked in Spark and Scala for Data Analytics . Handle ETL Framework in Spark for writing data from HDFS to Hive .
- Used Scala based written framework for ETL .
- Developed multiple spark streaming and core jobs with Kafka as a data pipe-line system
- Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS .
- Migrated an existing on-premises application to AWS . Used AWS services like EC2 and S3 for small data sets processing and storage.
- Imported data from AWSS3 into Spark RDD , Performed transformations and actions on RDD's .
- Extensively use Zookeeper as job scheduler for Spark Jobs .
- Worked on Talend with Hadoop . Worked in migrating from Informatica Talend jobs .
- Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
- Developed Kafka producer and consumer components for real time data processing.
- Worked on physical transformations of data model which involved in creating Tables, Indexes, Joins , Views and Partitions .
- Involved in CassandraDatamodeling to create key spaces and tables in multi Data Centre DSECassandra DB.
Environment: Spark, HDFS, Kafka, MapReduce (MR1), Pig, Hive, Sqoop, Cassandra, AWS, Talend, Java, Linux Shell Scripting.
Confidential, Chicago, IL
Big Data/Hadoop Developer
- Performed data transformations like filtering, sorting, and aggregation using Pig
- Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
- Created Hive tables to push the data to MongoDB.
- Wrote complex aggregate queries in mongo for report generation.
- Developed scripts to run scheduled batch cycles using Oozie and present data for reports
- Worked on a POC for building a movie recommendation engine based on Fandango ticket sales data using Scala and Spark Machine Learning library.
- Developed big data ingestion framework to process multi TB data including data quality checks, transformation, and stored as efficient storage formats like parquet and loaded into AmazonS3 using SparkScalaAPI and Spark.
- Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoopstreaming, ApacheSpark, SparkSQL, Scala, Hive, and Pig.
- Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
- Performed data validation and transformation using Python and Hadoop streaming.
- Developed highly efficient PigJavaUDFs utilizing advanced concept like Algebraic and Accumulator interface to populate ADP Benchmarks cube metrics.
- Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.
- Developed bash scripts to bring the TLOG file from ftp server and then processing it to load into hive tables.
- Automated workflows using shell scripts and Control-M jobs to pull data from various databases into HadoopDataLake.
- Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
- Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
- Involved in converting Hive/SQL queries into Spark transformations using SparkRDDs, Scala and have a good experience in using Spark-Shell and Spark Streaming.
- Designed, developed and maintained Big Data streaming and batch applications using Storm.
- Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
- Developed OozieWorkflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
- Sqoop jobs, PIG and Hive scripts were created for data ingestion from relational databases to compare with historical data.
- Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
- Developed pigscripts to transform the data into structured format and it are automated through Oozie coordinators.
- Used Splunk to captures, indexes and correlates real-time data in a searchable repository from which it can generate reports and alerts.
Environment: Hadoop, HDFS, Spark, Strom, Kafka, Map Reduce, Hive, Pig, Sqoop, Oozie, DB2, Java, Python, Splunk, UNIX Shell Scripting.
Confidential, Pittsburgh, PA
- Worked on SparkSQL to handle structured data in Hive.
- Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.
- Involved in migrating tables from RDBMS into Hivetables using SQOOP and later generate visualizations using Tableau.
- Worked on complex MapReduce program to analyses data that exists on the cluster.
- Analyzed substantial data sets by running Hive queries and Pig scripts.
- Written Hive UDFs to sort Structure fields and return complex data type.
- Worked in AWS environment for development and deployment of custom Hadoop applications.
- Involved in creating Shell scripts to simplify the execution of all other scripts (Pig, Hive, Sqoop, Impala and Map Reduce) and move the data inside and outside of HDFS.
- Creating files and tuned the SQLqueries in Hive utilizing HUE.
- Involved in collecting and aggregating large amounts of log data using Storm and staging data in HDFS for further analysis.
- Created the Hive external tables using Accumulo connector.
- Managed real time data processing and real time Data Ingestion in MongoDB and Hive using Storm.
- Created custom SOLRQuery segments to optimize ideal search matching.
- Developed Spark scripts by using Python shell commands.
- Stored the processed results In Data Warehouse, and maintaining data using Hive.
- Worked with Spark eco system using Spark SQL and Scala queries on different formats like Textfile, CSV file.
- Created Oozie workflow and Coordinator jobs to kick off the jobs on time for data availability.
- Worked with NoSQL databases like MongoDB in making MongoDB tables to load expansive arrangements of semi structured data.
- Developed Spark scripts by using Pythonshellcommands as per the requirement.
- Installed Oozieworkflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability.
- Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, EMR.
Environment: HDFS, MapReduce, Storm, Hive, Pig, Sqoop, MongoDB, Apache Spark, Python, Accumulo, Oozie Scheduler, Kerberos, AWS, Tableau, Java, UNIX Shell scripts, HUE, SOLR, GIT, Maven.