- Around 8+ years of IT industry experience encompassing wide range of skill set in Big Data technologies and Java/J2EE technologies
- 4+ years of experience in working with Big Data Technologies on systems which comprises of massive amount of data running in highly distributive mode in Cloudera, Horton works Hadoop distributions.
- Strong knowledge on Hadoop eco systems including HDFS, Hive, Oozie, HBase, Pig, Sqoop, Zookeeper, Flume, Kafka, MR2, Yarn, Spark etc.
- Excellent knowledge on Hadoop architecture 1.0 and 2.0 as in HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Resource Manager, Node Manager and MapReduce programming paradigm.
- Good understanding of Data Replication, HDFS Federation, High Availability, Rack Awareness Concepts.
- Extensive Knowledge on developing Spark Streaming jobs by developing RDD's (Resilient Distributed Datasets) and used pyspark and spark - shell accordingly.
- Experience on developing JAVA MapReduce jobs for data cleaning and data manipulation as required for the business.
- Executed complex HiveQL queries for required data extraction from Hive tables and written Hive UDF's as required.
- A strong ability to prepare and present data in a visually appealing and easy to understand manner using Tableau, Excel etc.
- Having a great knowledge on Denodo, performing many of the same transformation and quality functions as traditional data integration (Extract-Transform-Load (ETL), data replication, data federation, Enterprise Service Bus (ESB), etc.)
- Experience working with Cassandra and NoSQL database including MongoDB and Hbase.
- Managing and scheduling batch Jobs on a Hadoop Cluster using Oozie.
- Experience in managing and reviewing Hadoop Log files.
- Used Zookeeper to provide coordination services to the cluster.
- Hands on dealing with log files to extract data and to copy into HDFS using flume.
- Experience in analyzing data using Hive, Pig Latin, and custom MR programs in Java.
- Good understanding of different file formats like JSON, Parquet, Avro, ORC, Sequence, XML etc.
- Extensive experience on JAVA/J2EE technologies like Hibernate, Spring MVC
- Expertise in Core Java, data structures, algorithms, Object Oriented Design (OOD) and Java concepts such as OOPS Concepts, Collections Framework, Exception Handling, I/O System and Multi-Threading.
- Extensive experience on working with Soap and Restful web services.
- Extensive experience in working with Oracle, MS SQL Server, DB2, MySQL RDBMS databases.
- Experienced in working in SDLC, Agile and Waterfall Methodologies.
- Experience in working with Health Care, and ecommerce industries.
- Ability to meet deadlines without compromising in delivering right output.
- Excellent Communication skills, Interpersonal skills, problem solving skills and a team player.
- Ability to quickly adapt new environment and technologies.
- Authorized to work in United States for any employer
Senior Hadoop Developer
Confidential, Kansas City, KS
- Adding and Decommissioning Hadoop Cluster Nodes Including Balancing HDFS block data.
- Implemented Fair schedulers on the Resource Manager to share the resources of the Cluster for the MRv2 jobs given by the users.
- Worked with the systems engineering team to propose and deploy new hardware and software environments required for Hadoop and to expand existing environments.
- Perform investigation and migration from MRv1 to MRv2.
- Worked with Big Data Analysts, Designers and Scientists in troubleshooting MRv1/MRv2 job failures and issues with Hive, Pig, Flume, and Apache Spark.
- Implemented Kafka Producer module in Spring Boot (Gradle Built) to produce messages from MongoDB to a defined Kafka topic.
- Utilized Apache Spark for Interactive Data Mining and Data Processing.
- Accommodate load in its place before the data is analyzed using Apache Kafka with its fast, scalable, fault-tolerant system.
- Configuring Sqoop to import and export data from HDFS to RDBMS and vice-versa.
- Handle the data exchange between HDFS & Web Applications and databases using Flume and Sqoop.
- Used Hive and created Hive tables involved in data loading.
- Created Base Views, Derived Views, Joins, Unions, Projection, Selection, Union, Minus, Flatten Views, Interface and associations of data service layers in Denodo.
- Created scheduled jobs for data extracts and report reloads by Denodo Scheduler.
- Extensively involved in querying using Hive, Pig.
- Developed open source Impala/Hive Liqui base plug-in to schema migration in CI/CD pipelines.
- Involved in writing custom UDF's for extending Pig core functionality.
- Involved in writing custom MR jobs which utilize Java API.
- Familiarity with NoSQL database Hbase.
- Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
- Setup automated processes to analyze the System and Hadoop log files for predefined errors and send alerts to appropriate groups.
- Setup automated processes to archive/clean the unwanted data on the cluster, on Name node and Standby node.
- Involved in Analyzing system failures, identifying root causes, and recommended course of actions. Documented the systems processes and procedures for future s.
- Supported technical team members in management and review of Hadoop log files and data backups.
- Designed target tables as per the requirement from the reporting team and designed Extraction, Transformation and Loading (ETL) using Talend.
- Implemented File Transfer Protocol operations using Talend Studio to transfer files in between network folders.
- Participated in development and execution of system and disaster recovery processes.
- Experience with cloud AWS and service like EC2, ELB, RDS, Elasti Cache, Route53, EMR.
- Hands on experience in cloud configuration for Amazon web services (AWS).
- Hands on experience with container technologies such as Docker, embed containers in existing CI/CD pipelines.
- Set up independent testing lifecycle for CI/CD scripts with Vagrant and Virtual box.
Environment: Hadoop, Hortonworks(HDP), MapReduce2, Hive, Pig, HDFS, Sqoop, Oozie, Talend, Kafka, Spark, Python, Spring Boot, HBase, Zookeeper, Impala, LDAP, NoSQL, MySQL, Info bright, Linux, AWS.
Confidential, Sacramento, CA
- Handled importing of data from various data sources, performed data control checks using Spark and load data into HDFS.
- Built real time pipeline for streaming data using Kafka and Spark Streaming.
- Spark Streaming collects this data from Kafka in near-real-time and performs necessary transformations and aggregation on the fly to build the common learner data model and persists the data in Cassandra.
- Developed Spark scripts by using Scala shell commands as per the requirement.
- Load the data into Spark RDD and performed in-memory data computation to generate the output response.
- Experienced in implementing Spark RDD transformations, actions to implement business analysis.
- Created Hive queries that helped market analysts spot emerging trends by comparing fresh data with HDFS tables and historical metrics.
- Experienced on loading and transforming of large sets of data from Cassandra source through Kafka and placed in HDFS for further processing.
- Implemented Cassandra connector for Spark 1.6.1.
- Implemented Cassandra connection with the Resilient Distributed Datasets.
- Written customized Hive UDFs in Java where the functionality is too complex.
- Used Flume to collect, aggregate, and store the log data from different web servers.
- Created business data reports using Spark SQL.
- Implemented Spark using Scala and Spark SQL for faster testing and processing of data.
- Processed the streaming data using Kafka, integrating with Spark Streaming API.
- Worked on the Spark SQL for analyzing the data.
- Used Sqoop to import the data from RDBMS to Hadoop Distributed File System (HDFS) and later analyzed the imported data using Hadoop Components.
- Designed and Developed Hive managed/external tables using Struct, Maps and Arrays using various storage formats.
- Implemented various performance techniques (Partitioning, Bucketing) in hive to get better performance.
- Imported Hive tables into Impala for generating reports using Talend.
- Created COBOL Layout, X2CJ file and Transforming data from source to Target Table using Informatica.
- Worked on Data Profiling using IDQ-Informatica Data Quality to examine different patterns of source data. Proficient in developing Informatica IDQ transformations like Parser, Classifier, Standardizer and Decision.
- Developed workflows using Oozie to automate the tasks of loading the data into HDFS.
- Used Oozie for automating the end to end data pipelines and Oozie coordinators for scheduling the work flows.
Environment: Hadoop, Hortonworks(HDP), Sqoop, Hive, HDFS, YARN, Zookeeper, Cassandra, Apache Spark, Python, Scala, Kafka, Oracle, Java, Informatica, Informatica IDQ.
Big Data Engineer
Confidential, Hoboken, NJ
- Configured Kafka/Flume ingestion pipeline to transmit the logs from web server to the Hadoop.
- Used interceptors with RegEx as part of flume configuration to eliminate the chunk from logs and dump the rest into HDFS.
- Used the Avro SerDe's for serialization & de-serialization of log files at different flume agents
- Created Pig Latin scripts for the duplication of the log files if any due to flume agent crash.
- Involved in partitioning the raw data, processed data each by day using one level partitioning schemes.
- Created the external tables in Hive based on the processed data obtained from Spark.
- Ingested the secondary data from systems like CRM, CPS, ODS using Sqoop and correlated this data with log files providing the platform for data analysis.
- Performed basic aggregations like count, average, sum, distinct, max, min on the existing hive tables using impala to determine Average Hit rates, Miss rates, Bounce rates etc.
- Persisted the processed data in columnar databases like HBASE and provided the platform for analytics using BI tools, analytical tools like R, machine learning such as Mahout.
- Involved in running and orchestrating the entire flow daily using Oozie jobs.
- Able to tackle the problems and accomplished the tasks which should be done during the sprint.
Environment -Cloudera (CDH), Flume 1.5.2, Sqoop1.4.6, HDFS2.6.0, Hadoop2.6.0, Hive0.14.0, Hbase0.98.0, Impala2.1.0, Pig 0.14.0, Oozie 4.1.0
- Involved in the development of use case documentation, requirement analysis, and project documentation.
- Developed and maintained Web applications as defined by the Project Lead.
- Used MS Visio for creating business process diagrams.
- Developed Action Servlet, Action Form, Java Bean classes for implementing business logic for the struts Framework.
- Developed Servlets and JSP based on MVC pattern using struts Action framework.
- Developed all the tiers of the J2EE application. Developed data objects to communicate with the database using JDBC in the database tier, implemented business logic using EJBs in the middle tier, developed Java Beans and helper classes to communicate with the presentation tier which consists of JSPs and Servlets.
- Used AJAX for Client side validations.
- Applied annotations for dependency injection and transforming POJO/POJI to EJBs.
- Developed persistence layer modules using EJB Java Persistence API (JPA) annotations and Entity manager.
- Involved in creating EJBs that handle business logic and persistence of data.
- Developed Action and Form Bean classes to retrieve data and process server-side validations.
- Designed various tables required for the project in Oracle database and used Stored Procedures in the application. Used PL SQL to create, update and manipulate tables.
- Used IntelliJ as IDE and Tortoise SVN for version control.
- Involved in impact analysis of Change requests and Bug fixes.
Environment: Java 5, Struts, PL/SQL, Oracle, EJB, IntelliJ, Tortoise SVN, MS Visio, Firebug, Apache Tomcat, JSP, Java Script, CSS.