- 9+ years of professional Java Development experience this includes excellent experience in Big Data Ecosystem - Hadoop and Data engineering technologies.
- Experienced in installation, configuration, management and deployment of Hadoop Cluster, HDFS, Map Reduce, Pig, Hive, Sqoop, Flume, Oozie, Nifi, HBase, and Zookeeper.
- Expertise in handling importing of data from various data source, performed transformation, and hands on developing and debugging YARN (MR2) jobs to process large data sets.
- Very good Knowledge and experience in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provide fast and efficient processing of Teradata Big Data Analytics.
- Experienced on extending Pigand Hive core functionality by writing custom UDF’s for Data Analysis. Data transformation, file processing, and identifying user behavior by running Pig Latin Scripts and expertise in creating Hive internal/external Tables/Views using shared Meta Store, writing scripts in HiveQL. Develop Hive queries helps for visualizing business requirement.
- Excellent experience working with importing and exporting Teradata using Sqoop from HDFS to RDBMS/mainframe& vice versa. Also, worked on incremental import by creating Sqoop metastore jobs.
- Experienced in using ApacheFlume for collecting, aggregation, moving large amount of data from application server and also handling variety of data using streaming and velocity of data.
- Expertise in Data Development in Hortonworks HDP platform & Hadoop ecosystem tools like Hadoop, HDFS, Spark, Zeppelin, SparkMLlib, Hive, HBase, SQOOP, flume, Atlas, SOLR, Pig, Falcon, Oozie, Hue, Tez, Apache NiFi, Kafka.
- Experienced in Extraction, Transformation, and Loading (ETL) of data from multiple sources like Flat files, XML files, and Databases. Used informatica for ETL processing based on business need and extensively used Oozie workflow engine to run multiple Hive and Pig jobs.
- Excellent understanding of Zookeeper and Kafka for monitoring and managing Hadoop jobs and used ClouderaCDH 4x, CDH 5x for monitoring and managing Hadoop cluster.
- Experienced working on HCatalog to share the schema across the distributed application and experience in batch processing and writing programs using ApacheSpark for handling real-time analytics and real streaming of data.
- Experienced on NoSql technologies like Hbase, Cassandra for data extraction and storing huge volume of data. Also, experience in Data Warehouse life cycle, methodologies, and its tools for reporting and data analysis.
- Expertise in creating action filters, parameters and calculated set for preparing dashboard and worksheet in Tableau.
- Expertise in core Java, J2EE, Multithreading, JDBC, Hibernate, spring, Shell Scripting and proficient in using Java API's for application development.
- Good exposure with TalendOpen studio for data integration.
- Extensive knowledge in developing ANTscripts to build and deploy application and experience in Maven to build and manage Java projects.
- Strong experience in developing Enterprise and Web applications on n-tier architecture using Java/J2EE based technologies such as Servlets, JSP, Spring, Hibernate, Struts, EJBs, Web Services, XML, JPA, JMS, JNDI and JDBC.
- Experienced creating use case model, use case, class, sequence diagrams using Microsoft Visio and Rational Rose. Experience in design and development of object oriented analysis design (OOAD) based system using Rational Rose.
- Experienced in using RDBMS concepts and worked with Oracle 10g/11g, SQL server and good experience in writing stored procedures, Functions and Triggers using PL/SQL.
- Major strengths are familiarity with multiple software systems, ability to learn quickly new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick learner with excellent interpersonal, technical and communication skills.
- Strong oral and written communication, initiation, interpersonal, learning and organizing skills matched with the ability to manage time and people effectively.
Hadoop Core Services: HDFS, Map Reduce, Spark, YARN.
Hadoop Distribution: Horton works, Cloudera, Apache.
NO SQL Databases: HBase, Cassandra, MongoDB
Hadoop Data Services: Hive, Pig, Impala, Sqoop, Flume, Spark, Nifi, Strom, and Kafka.
Hadoop Operational Services: Zookeeper, Oozie.
Cloud Computing Tools: Amazon AWS, EC2, S3, EMR.
Languages: Java, Python, Scala, SQL, PL/SQL, Pig Latin, HiveQL, Unix, Java Script, Shell Scripting.
Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JMS, EJB.
Application Servers: Web Logic, Web Sphere, Tomcat.
Databases: Oracle, MySQL, Microsoft SQL Server, Teradata.
Business Intelligence Tools: Tableau, Talend(ETL)
Operating Systems: UNIX, Windows, LINUX.
Build Tools: Jenkins, Maven, ANT.
Development Tools: Microsoft SQL Studio, Toad,Eclipse, NetBeans.
Development Methodologies: Agile/Scrum, Waterfall.
Confidential, Chicago, IL
Sr. BigData Architect
- Installed and Setup Hadoop CDH clusters for development and production environment and installed and configured Hive, Pig, Sqoop, Flume, Cloudera manager and Oozie on the Hadoop cluster.
- Planning for production cluster hardware and software installation on production cluster and communicating with multiple teams to get it done.
- Migrated data from Hadoop to AWS S3 bucket using DISTCP. Also migrated data across new and old clusters using DISTCP.
- Developed multiple MapReduce jobs in java for Data Cleaning and pre-processing analyzing data in PIG.
- Monitored multiple Hadoop clusters environments using ClouderaManager. Monitored workload, job performance and collected metrics for Hadoop cluster when required.
- Worked on setting up and configuring AWS's EMR Clusters and Used Amazon IAM to grant fine-grained access to AWS resources to users and implementing solutions using services like (EC2, S3, RDS, Redshift, VPC)
- Installed Hadoop patches, updates and version upgrades when required
- Installed and configured Cloudera Manager, Hive, Pig, Sqoop and Oozie on the CDH4 cluster.
- Involved in implementing High Availability and automatic failover infrastructure to overcome single point of failure for Namenode utilizing zookeeper services.
- Responsible for writing Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL) and Hive UDF's in Python.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Involved in installing EMR clusters on AWS and used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Implemented Apache Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Identify query duplication, complexity and dependency to minimize migration efforts Technology stack: Oracle, Hortonworks HDP cluster, Attunity Visibility, Cloudera Navigator Optimizer, AWS Cloud and Dynamo DB.
- Performed an upgrade in development environment from CDH 4.2 to CDH 4.6.
- Design & Develop ETL workflow using Oozie for business requirements, which includes automating the extraction of data from MySQL database into HDFS using Sqoop scripts.
- Extensive experience in Spark Streaming through core Spark API running Scala, Java to transform raw data from several data sources into forming baseline data.
- Design and create the Complete "ETL" process from end-to-end using Talend jobs and create the test cases for validating the Data in the Data Marts and in the Data Warehouse.Everyday Capture the data from OLTP Systems and various sources of XML, EXCEL and CSV and load the data into Talend ETL Tools.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
- Automated end to end workflow from Data preparation to presentation layer for Artist Dashboard project using Shell Scripting.
- Designed the schema, configured and deployed AWS Redshift for optimal storage and fast retrieval of data and used Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's using Scala, Spark SQL and MLlib libraries.
- Developed Map reduce program which were used to extract and transform the data sets and result dataset were loaded to Cassandra and vice-versa using kafka and using Kafka messaging system registered to Cassandra brokers and pulled the data to HDFS.
- Involved in querying data using Spark SQL on top of Spark engine and involved in managing and monitoring Hadoop cluster using Cloudera Manager.
- Developed a Data flow to pull the data from the REST API using Apache Nifi with context configuration enabled and used NiFi to provide real-time control for the moment of data between source and destination.
- Conducting RCA to find out data issues and resolve production problems and proactively involved in ongoing maintenance, support and improvements in Hadoopcluster.
- Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real time analysis.
- Performed data analytics in Hive and then exported this metrics back to Oracle Database using Sqoop and used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Involved in designing and architecting data warehouses and data lakes on regular (Oracle, SQL Server) high performance (Netezza and Teradata) and big data (Hadoop - MongoDB, Hive, Cassandra and HBase) databases.
- Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
- Collaborating with business users/product owners/developers to contribute to the analysis of functional requirements.
Environment: ClouderaHadoop, Talend open studio, MapReduce, Python, HDFS, Nifi, Hive, Pig, Sqoop, Oozie, Flume, Zookeeper, LDAP, MongoDB, HBase, Cassandra, Python, Spark, Scala, AWS EMR, S3, Kafka, SQL, Java, Tableau, XML, PL/SQL, RDBMS and Pyspark.
Confidential, Stamford, CT
Sr. BigData Architect
- Involved in Installing, Configuring Hadoop Eco System, Cloudera Manager using CDH3, CDH4 Distributions.
- Involved in start to end process of Hadoop jobs that used various technologies such as Sqoop, PIG, Hive, MapReduce, Spark and Shellscripts (for scheduling of few jobs) extracted and loaded data into DataLake environment (AmazonS3) by using Sqoop which was accessed by business users and data scientists.
- Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
- Configured and monitored MongoDB cluster in AWS and establish connections from Hadoop to MongoDB data transfer.
- Used ScalaAPI for programming in ApacheSpark and imported data using Sqoop from Teradata using Teradata connector.
- Developed export framework using Python, Sqoop, Oracle & Mysql and Created Data Pipeline of Map Reduce programs using Chained Mappers.
- Implemented Optimized join base by joining different data sets to get top claims based on state using Map Reduce.
- Visualize the HDFS data to customer using BI tool with the help of HiveODBC Driver.
- Worked on POC of Talend integration with Hadoop where Created Talend Jobs to extract data from Hadoop.
- Installed KAFKA on Hadoop cluster and configured producer and consumer coding part in java to establish connection from twitter source to HDFS.
- Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, ZooKeeper, SQOOP, flume, Spark, Impala, and Cassandra with Horton work Distribution.
- Imported data using Sqoop to load data fromMySQL to HDFS on regular basis
- Worked on social media (Facebook, Twitter etc) data crawling using Java and R language and MongoDBfor unstructured data storage.
- Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing BigData technologies such as Hadoop, MapReduce Frameworks, HBase, Hive, Oozie, Flume, Sqoop etc
- Installed Hadoop, Map Reduce, HDFS, and AWS and developed multiple MapReduce jobs in PIG and Hive for data cleaning and pre-processing.
- Integrated Quartz scheduler with Oozie work flows to get data from multiple data sources parallels using fork
- Created Partitions, Buckets based on State to further process using Bucket based Hive joins.
- Experienced with different kind of compression techniques like LZO, GZip, and Snappy.
- Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
- Created Hive Generic UDF's to process business logic that varies based on policy.
- Imported Relational Data base data using Sqoop into Hive Dynamic partition tables using staging tables.
- Worked on custom Pig Loaders and storage classes to work with variety of data formats such as JSON and XML file formats.
- Worked with teams in setting up AWS EC2 instances by using different AWS services like S3, EBS, Elastic Load Balancer, and Auto scaling groups, VPC subnets and CloudWatch.
- Used SparkAPI over Hortonworks Hadoop YARN to perform analytics on data in Hive.
- Improved the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Exploring with the Spark improving the performance and optimization of the existing algorithms inHadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Import the data from different sources like HDFS/Hbase into Spark RDD and developed a data pipeline using Kafka and Storm to store data into HDFS.
- Used Oozie workflow engine to manage interdependent Hadoop jobs and to automate several types of Hadoop jobs such as Java map-reduce Hive, Pig, and Sqoop
- Used Spark as an ETL tool to remove Duplicates, Joins and aggregate the input data before storing in a Blob.
- Developed code in Java which creates mapping in ElasticSearch even before data is indexed into.
- Experienced in Monitoring Cluster using Cloudera manager and developed Unit test cases using Junit, Easy Mock and MRUnit testing frameworks.
Environment: Hadoop, HDFS, HBase, Spark, MapReduce, Teradata, MySQL, Java, Python, Hive, Pig, Sqoop, Flume, Oozie, SQL, Cloudera Manager, MongoDB, Cassandra, Scala, Python, AWS EMR, S3, EC2, RDBMS, SQL, Java, XML, Elastic Search, Kafka, MySQL, Tableau, ETL.
Confidential, Nashville, TN
- Imported Data from Different Relational Data Sources like RDBMS, Teradata to HDFS using Sqoop.
- Imported Bulk Data into HBase Using Map Reduce programs and perform analytics on Time Series Data exists in HBase using HBaseAPI.
- Designed and implemented Incremental Imports into Hive tables and used Rest API to Access HBase data to perform analytics.
- Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Flume, OozieZookeeper and Sqoop.
- Created POC to store Server Log data in MongoDB to identify System Alert Metrics.
- Importing of data from various data sources, performed transformations using Hive, MapReduce, and loaded data into HDFS& Extracted the data from MySQL into HDFS using Sqoop.
- Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
- Worked in Loading and transforming large sets of structured, semi structured and unstructured data
- Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume
- Implemented end-to-end systems for Data Analytics, Data Automation and integrated with custom visualization tools using R, Hadoop and MongoDB, Cassandra.
- Involved in Installation and configuration of Cloudera distribution Hadoop, NameNode, Secondary NameNode, JobTracker, TaskTrackers and DataNodes.
- Written Hive jobs to parse the logs and structure them in tabular format to facilitate effective querying on the log data.
- Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
- Used S3 Bucket to store the jar's, input datasets and used Dynamo DB to store the processed output from the input data set.
- Worked with Cassandra for non-relational data storage and retrieval on enterprise use cases and wrote MapReduce jobs using Java API and Pig Latin.
- Improving the performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL and Spark YARN.
- Involved in creating Data Lake by extracting customer's Big Data from various data sources into Hadoop HDFS. This included data from Excel, Flat Files, Oracle, SQL Server, MongoDb, Cassandra, HBase, Teradata, Netezza and also log data from servers
- Doing data synchronization between EC2 and S3, Hive stand-up, and AWSprofiling.
- Created reports for the BI team using Sqoop to export data into HDFS and Hive.
- Involved in creating Hive tables, and loading them into dynamic partition tables.
- Experienced in managing and reviewing the Hadoop log files.
- Migrated ETL jobs to Pig scripts to do Transformations, even joins and some pre-aggregations before storing the data to HDFS.
- Worked on NoSQL databases including HBase and MongoDB. Configured MySQL Database to store Hivemetadata.
- Deployment and Testing of the system in HadoopMapR Cluster.
- Worked on different file formats like Sequence files, XML files and Map files using Map Reduce Programs.
- Developed multiple MapReduce jobs in Java for data cleaning and preprocessing
- Imported data from RDBMS environment into HDFS using Sqoop for report generation and visualization purpose using Tableau.
- Worked on Oozie workflow engine for job scheduling.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing PigScripts.
Confidential, Little Rock, AR
- Involved in Design, Development and Support phases of Software Development Life Cycle (SDLC)
- Primarily responsible for design and development using Java, J2EE, XML, Oracle SQL, PLSQL and XSLT.
- Experience of gathering data for requirements and use case development.
- Implemented DAO's using SpringJdbc support to interact with the RMA database. Spring framework was used for transaction handling.
- Reviewed the functional, design, source code and test specifications
- Worked with Spring Core, Spring AOP, Spring Integration Framework with JDBC.
- Implemented backend configuration on DAO, and XML generation modules of DIS
- Developed persistence layer using ORMHibernate for transparently store objects into database.
- Implemented RESTful web services using spring which supports JSON data formats.
- Used JDBC for database access, and also used Data Transfer Object (DTO) design patterns
- Unit testing and rigorous integration testing of the whole application
- Implemented user interface using Struts2 MVC Framework, Struts Tag Library, HTML, CSSand JSP.
- Written and executed the Test Scripts using JUNIT and also actively involved in system testing
- Developed XML parsing tool for regression testing
- Worked on documentation that meets with required compliance standards. Also, monitored end-to-end testing activities.
- Worked on Java Struts1.0 synchronized with SQL Server Database to develop an internal application for ticket creation.
- Designed and developed GUI using JSP, HTML, DHTML and CSS.
- Mapped an internal tool with Service now ticket creation tool.
- Wrote Hibernate configuration file, Hibernate mapping files and defined persistence classes to persist the data into Oracle Database.
- Individually developed Parser logic to decode the Spawn file generated from Client side and to generate a ticket based system on Business Requirements.
- Used XSL transforms on certain XML data.
- Developed ANTscript for compiling and deployment and performed unit testing using Junit.
- Build SQLqueries for fetching the required data and columns from production Database.
- Implemented MVC Architecture for front end development using spring MVC Framework.
- Implemented User Interface using HTML 5, JSP, CSS 3, Java script/JQuery and performed validations using Java Script libraries.
- Used Tomcat server for deployment.
- Modified Agile/Scrum methodology is used for development of this application.
- Used Web services (SOAP) for transmission of large blocks of XML data over HTTP.
- Highly involved in writing all database related issues with Stored Procedures, Triggers and tables based on requirements.
- Prepared Documentation and User Guides to identify the various Attributes and Metrics needed from Business.
- Handled SVNVersion control as code repository.
- Conduct Knowledge Transfer (KT) sessions on the business value and technical functionalities incorporated in the developed modules for new recruits.
- Created a maintenance plan for production database. Facilitated with Oracle Certified Java Programmer 6.
Environment: MS Windows 2000, OS/390, J2EE ( JSP, Struts, Spring, Hibernate), Restful, Soap, SQL Server 200 5, Eclipse, Tomcat 6, HTML, CSS, JSP, JASON, AJAX, JUnit, SQL, My SQL.