- Above 9+ years of experience in IT industry, including big data environment, Hadoop ecosystem and Design, Developing, Maintenance of various applications.
- Experienced in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
- Good Knowledge and experience in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
- Expertise in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL.
- Experience in build scripts using Maven and do continuous integrations (CI/CD) systems like Jenkins.
- Experienced in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
- Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions and leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web - based dashboards and reports.
- Excellent hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.
- Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
- Experienced in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and Map Reduce open source tools and experience in installation, configuration, supporting and managing Hadoop clusters.
- Strong hands on experience with AWS services, including but not limited to EMR, S3, EC2, route 53, RDS, ELB, Dynamo DB, Cloud Formation, etc.
- Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
- Worked on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.
- Experienced in Data Modeling &Data Analysis experience using Dimensional Data Modeling and Relational Data Modeling, Star Schema/Snowflake Modeling, FACT & Dimensions tables, Physical & Logical Data Modeling.
- Very good experience and knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQL Server, Teradata and Netezza using Sqoop.
- Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing - Map Reduce framework.
- Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, Netbeans.
- Expert in Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
- Experience in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4 & CDH5 clusters.
- Experience in working with different data sources like Flat files, XML files and Databases.
- Experience in database design, entity relationships, database analysis, programming SQL, stored procedures PL/ SQL, packages and triggers in Oracle and experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
Big Data Ecosystem: MapReduce, HDFS, HIVE, Pig, Sqoop, Flume, HDP, Oozie, Zookeeper, Spark, Kafka, storm, Hue Hadoop Distributions Cloudera (CDH3, CDH4, CDH5), Hortonworks and Flume
SQL and NoSQL Databases: Oracle 12c/11g, MySQL, MS-SQL, Teradata, HBase, MongoDB, Cassandra.
Version Control: GIT, GitLab, SVN
ETL and Data Modeling: Informatica, AWS Glue, Erwin and MS visio.
Cloud Technologies: AWS S3, Redshift, EMR, EC2, Rest APIs, MS Azure, Data Factory and Google Cloud
Java/J2EE Technologies: Servlets, JSP, JDBC, JSTL, EJB, JAXB, JAXP, JMS, JAX-RPC, JAX- WS
Programming Languages: Java, Python, SQL, PL/SQL, AWS, HiveQL, UNIX Shell Scripting, Scala.
Methodologies: Software Development Lifecycle (SDLC), Waterfall Model and Agile, STLC (Software Testing Life cycle) & UML, Design Patterns (Core Java and J2EE)
Operating Systems: Windows, UNIX/Linux and Mac OS.
Build Management Tools: Maven, Ant.
IDE & Command line tools: Eclipse, IntelliJ, Toad and Netbeans.
Sr. Big Data Architect
Confidential, new York NY
- Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
- Managed and lead the development effort with the help of a diverse internal and overseas group and design/ architected and implemented complex projects dealing with the considerable data size (GB/ PB) and with high complexity.
- Designed and deployed full SDLC of AWS Hadoop cluster based on client's business need and involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
- Design AWS architecture, Cloud migration, AWS EMR, Dynamo DB, Redshift and event processing using lambda function
- Utilized Apache Spark with Python to develop and execute Big Data Analytics and Machine learning applications, executed machine learning use cases under Spark ML and Mllib.
- Performed data profiling and transformation on the raw data using Pig, Python, and Java and developed predictive analytic using Apache Spark Scala APIs.
- Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra) and responsible for importing log files from various sources into HDFS using Flume
- Analyzed data using HiveQL to generate payer by reports for transmission to payer's form payment summaries.
- Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
- Implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
- Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
- Worked on TeradataSQL queries, Teradata Indexes, Utilities such as Mload, Tpump, Fast load and FastExport.
- Designed and Developed Real time Stream processing Application using Spark, Kafka, Scala and Hive to perform Streaming ETL and apply Machine Learning.
- Exploring with the Spark for improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN.
- Exploring DAG's, their dependencies and logs using AirFlow pipelines for automation and use Apache Airflow to schedule and run the airflow dags to execute code.
- Involved in working of big data analysis using Pig and User defined functions (UDF) and created Hive External tables and loaded the data into tables and query data using HQL.
- Implemented Spark GraphX application to analyze guest behavior for data science segments.
- Enhancements to traditional data warehouse based on STAR schema, update data models, perform Data Analytics and Reporting using Tableau.
- Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
- Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with .csv, JSON, parquet and HDFS files.
- Developed HiveQL scripts for performing transformation logic and also loading the data from staging zone to landing zone and Semantic zone.
- Maintain and work with our data pipeline that transfers and processes several terabytes of data using Spark, Scala, Python, Apache Kafka, Pig/ Hive & Impala
- Involved in creating Oozie workflow and Coordinator jobs for Hive jobs to kick off the jobs on time for data availability and worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.
- Exported the generated results to Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
- Involved in scheduling Airflow workflow engine to run multiple Hive and pig jobs using python.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig and parsed high-level design spec to simple ETL coding and mapping standards.
- Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
- Responsible for Design EDW Application Solutions & deployment, optimizing processes, definition and implementation of best practice
Sr. Big Data/Hadoop Architect
Confidential, Dallas, TX
- Implemented Installation and configuration of multi-node cluster on Cloud using Amazon Web Services (AWS) on EC2 and used AWS Data Pipeline to schedule an Amazon EMR cluster to clean and process web server logs stored in Amazon S3 bucket.
- Involved in configuring batch job to perform ingestion of the source files in to the Data Lake and developed Pig queries to load data to HBase
- Evaluate deep learning algorithms for text summarization using Python, Keras and TensorFlow on Cloudera Hadoop System
- Developed Sqoop scripts for the extractions of data from various RDBMS databases into HDFS and developed scripts to automate the workflow of various processes using python and shell scripting.
- Installed and configured Hadoop Ecosystem like Hive, Oozie, Sqoop by which implemented using Cloudera Hadoop cluster for helping with performance tuning and monitoring.
- Collected and aggregate large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
- Developed data pipeline using Pig and Hive from Teradata, DB2 data sources and these pipelines had customized UDF'S to extend the ETL functionality and extensively used ETL methodology for supporting Data Extraction, transformations and loading processing, using Hadoop.
- Created Data Pipeline using Processor Groups and multiple processors using Apache Nifi for Flat File, RDBMS as part of a POC using Amazon EC2.
- Developed Scala scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark 2.0.0 for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
- Wrote Hive join query to fetch info from multiple tables, writing multiple Map Reduce jobs to collect output from Hive and used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
- Developed Map-Reduce programs using java and python to parse the raw data and store the refined data in Hive.
- Used AWS Cloud and On-Premise environments with Infrastructure Provisioning/ Configuration.
- Worked on writing Perl scripts covering data feed handling, implementing mark logic, communicating with web services through SOAP Lite module and WSDL.
- Used UDF's to implement business logic in Hadoop by using Hive to read, write and query the Hadoop data in HBase.
- Used Oozie workflow engine to run multiple Hive and Pig Scripts with the help of Kafka for the real-time processing of data to navigate through data sets in the HDFS storage by loading Log File data directly into HDFS using Flume.
- Use Spark API for Machine learning and translate a predictive model from SAS code to Spark and used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
- Developed an end-to-end workflow to build a real time dashboard using Kibana, Elastic Search, Hive and Flume.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Developed Python MapReduce programs for log analysis and Designed Algorithm for finding the fake review by using python.
- Objective of this project is to build a data lake as a cloud based solution in AWS using Apache Spark and provide visualization of the ETL orchestration using CDAP tool.
- Extracted the data from MySQL, AWS RedShift into HDFS using Sqoop and Worked with AWS to implement the client-side encryption as Dynamo DB does not support at rest encryption at this time.
- Implemented a proof of concept deploying this product in Amazon Web Services AWS and AWS Cloud and On-Premise environments with Infrastructure Provisioning / Configuration.
- Involved in developing Map-reduce framework, writing queries scheduling map-reduce and developed the code for Importing and exporting data into HDFS and Hive using Sqoop
- Using Oozie for designing workflows and scheduling various jobs in the Hadoop ecosystem.
- Developed Map Reduce programs in java for applying business rules on the data and optimizing them using various compression formats and combiners.
- Using SparkSQL to create data frames by loading JSON data and analyzing it and developed Spark code using Scala and Spark-SQL for faster testing and data processing.
- Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
- Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
Environment: Pig, Sqoop, Kafka, Apache Cassandra, Oozie, Impala, Cloudera, AWS, AWS EMR, Redshift, Flume, Apache Hadoop, HDFS, Hive, Map Reduce, Cassandra, Zookeeper, MySQL, Eclipse, Dynamo DB, PL/SQL and Python.
Sr. Big Data/Hadoop Developer
Confidential - Albany, NY
- Responsible for building scalable distributed data solutions using Hadoop and designed the projects using MVC architecture providing multiple views using the same model and thereby providing efficient modularity and scalability.
- Designed, deployed, maintained and lead the implementation of Cloud solutions using Microsoft Azure and underlying technologies
- Custom Talend jobs to ingest and distribute data in Cloudera Hadoop ecosystem and improving the performance and optimization of existing algorithms in Hadoop using Spark context, Spark-SQL and Spark YARN using Scala.
- Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data and converted to Parquet using Data Frames in Spark.
- Implemented Spark Core in Scala to process data in memory and performed job functions using Spark API's in Scala for real time analysis and for fast querying purposes.
- Interacted with multiple teams who are responsible for Azure Platform to fix the Azure Platform Bugs and worked on container-based technologies like Docker, and Kubernetes.
- Enhanced and optimized product Spark code to aggregate, group and run data mining tasks using the Spark framework and Used Hadoop Pig, Hive and Map Reduce for analyzing the data to help by extracting data sets for meaningful information.
- Handled importing of data from various data sources, performed transformations using MapReduce, Spark and loaded data into HDFS.
- Developed workflow in Oozie to orchestrate a series of Pig scripts to cleanse data, such as merging many small files into a handful of very large, compressed files using pig pipelines in the data preparation stage.
- Implemented OLAP multi-dimensional cube functionality using Azure SQL Data Warehouse and Wrote AZURE POWERSHELL scripts to copy or move data from local file system to HDFS Blob storage.
- Used Pig in three distinct workloads like pipelines, iterative processing and research and used Pig UDF's in Python, Java code and uses sampling of large data sets.
- Extensively used PIG to communicate with Hive using HCatalog and HBASE using Handlers and created PIG Latin scripting and Sqoop Scripting.
- Involved in transforming data from legacy tables to HDFS, and HBASE tables using Sqoop and implemented exception tracking logic using Pig scripts and involved in moving all log files generated from various sources to HDFS for further processing through Flume and process the files.
- Analyzed large amounts of data sets to determine optimal way to aggregate and report on it and test Driven Development (TDD) process and extensive experience with Agile and SCRUM programming methodology.
- Implemented POC to migrate Map Reduce jobs into Spark RDD transformations using SCALA and scheduled map reduce jobs in production environment using Oozie scheduler.
- Involved in Cluster maintenance, Cluster Monitoring and Troubleshooting, Manage and review data backups and log files and exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
- Build, configured and deployed Web components on Web Logic application and application built on Java Financial platform, which is an integration of several technologies like Struts and Spring Web Flow.
- Used spring framework modules like Core container module, Application context module, Spring AOP module, Spring ORM and Spring MVC module.
- Developed the presentation layer using Model View Architecture implemented by Spring MVC.
- Performed Unit testing using JUnit and used SVN as version control tools to maintain the code repository.
Environment: Hadoop, MS Azure, Map Reduce, Spark, shark, Kafka, HDFS, Hive, Pig, Oozie, Core Java, Eclipse, HBase, Flume, Cloudera, Oracle10g, UNIX Shell Scripting, Scala, MongoDB, HBase, Cassandra, Python.
Sr. Java/J2EE Developer
- Involved in a full life cycle Object Oriented application development - Object Modeling, Database Mapping, GUI Design.
- Developed the J2EE application based on the Service Oriented Architecture and used Design Patterns like Singleton, Factory, Session Facade and DAO.
- Developed using new features of Java Annotations, Generics, enhanced for loop and Enums.
- Developed Use Case diagrams, Class diagrams and Sequence diagrams to express the detail design.
- Worked with EJB (Session and Entity) to implement the business logic to handle various interactions with the database.
- Skilled in using collections in Python for manipulating and looping through different user defined objects.
- Implemented a high-performance, highly modular, load-balancing broker in C with ZeroMQ and Redis.
- Used spring and Hibernate for implementing IOC, AOP and ORM for back end tiers and created and injected spring services, spring controllers and DAOs to achieve dependency injection and to wire objects of business classes.
- Developed a fully automated continuous integration system using Git, Jenkins, MySQL and custom tools developed in Python and Bash.
- Part of team implementing REST API's in Python using micro-framework like Flask with
- Used Spring Inheritance to develop beans from already developed parent beans and used DAO pattern to fetch data from database using Hibernate to carry out various database.
- Used SOAP Lite module to communicate with different web-services based on given WSDL.
- Worked on Evaluating, comparing different tools for test data management with Hadoop.
- Helped and directed testing team to get up to speed on Hadoop Application testing and used Hibernate Transaction Management, Hibernate Batch Transactions, and cache concepts.
- Modified the Spring Controllers and Services classes so as to support the introduction of spring framework.
- Skilled in using collections in Python for manipulating and looping through different user defined objects.
- Used Log4j utility to generate run-time logs and Wrote SAX and DOM XML parsers and used SOAP for sending and getting data from the external interface.
- Deployed business components into WebSphere Application Server and developed Functional Requirement Document based on users' requirement.
Environment: Core Java, J2EE, JDK 1.6, Python, spring 3.0, Hibernate 3.2, Tiles, AJAX, JSP 2.1, Eclipse 3.6, IBM WebSphere7.0, XML, XSLT, SAX, DOM Parser, HTML, UML, Oracle10g, PL/ SQL, JUnit.
- Implemented Spring MVC architecture and Spring Bean Factory using IOC, AOP concepts.
- Gathered the requirements and designed the application flow for the application.
- Developed Action classes for the system as a feature of Struts and performed both Server side and Client side Validations.
- Developed EJB component to implement business logic using Session and Message Bean.
- Used Spring Framework to integrate with Struts web framework, Hibernate.
- Extensively worked with Hibernate to connect to database for data persistence and integrated Activate Catalog to get parts using JMS.
- Used Log4J log both User Interface and Domain Level Messages.
- Extensively worked with Struts for middle tier development with Hibernate as ORM and Spring IOC for Dependency Injection for the application based on MVC design paradigm.
- Created struts-config.xml file to manage with the page flow and developed html views with HTML, CSS, and Java Script.
- Performed Unit testing for modules using Junit and played an active role in preparing documentation for future reference and upgrades.
- Worked in an Agile Environment used Scrum as the methodology wherein I was responsible for delivering potentially shippable product increments at the end of each Sprint.
- Involved in Scrum meetings that allow clusters of teams to discuss their work, focusing especially on areas of overlap and integration.
Environment: Java 1.4, JSP, Servlets, Java Script,, HTML 5, AJAX, JDBC, JMS, EJB, Struts 2.0, Spring 2.0, Hibernate 2.0, Eclipse 3.x, WebLogic9, Oracle 9i, Junit, Log4j