We provide IT Staff Augmentation Services!

Big Data Developer Resume

Long Beach, CA


  • Over 8 years of experience in application development and design using Hadoop echo system tools and Java /J2EE Technologies.
  • Developed and built frameworks that integrate big data and advanced analytics to make business decisions.
  • Extensive experience in installing, configuring and using eco system components like Hadoop Map reduce , HDFS, Hive, Pig, Flume, Sqoop and Spark.
  • Preprocessed and cleansed big data for better analysis.
  • Certified Cloudera Spark and Hadoop Developer
  • Experience in Cloudera distributions (CDH) and Hortonworks Data Platform (HDP)
  • Created various use cases using massive public big data sets. Ran various performance tests for verifying the efficacy of Map Reduce, PIG and HIVE
  • Migrated to Azure cloud and created end - to-end architecture for running in Cloud.
  • Have experience on ADF, ADLS, Blob Storage, HD Insights, Ranger, S3, IR, IoTHub, Stream Analytics, etc.
  • Good knowledge of Amazon Web Services (AWS) components like EC2, EMR, S3, CloudWatch etc.
  • Proficiency in developing applications using Java, JSP, JavaScript, JDBC, Selenium, Oracle ADF, Python
  • Strong coding and debugging skills in Java Platform
  • Experience in shipping enterprise products, web/mobile UI applications to a large customer base
  • Experienced in Full Life Cycle development of software products
  • Good at Servlets, JSPs and MVC framework
  • Have excellent analytical and problem-solving skills and ability to learn new technologies quickly


Learning: Can rapidly adapt to new environments and designs.

Apache Hadoop: HDFS, Hive, Pig, MapReduce, Flume, Sqoop and Spark

Cloud: HDInsight, ADLS, ADF, S3, EMR, EC2, NACL, Security groups

Programming Language & Scripts: Java, J2EE, UNIX, Java Script, SQL, UML, XML, CSS, JSON

Enterprise Java: JSP, Servlets, JSF, EJB, JMS, Socket Programming, Java Beans

Software Design: Design Patterns, Data Structures, Object Oriented design

Tools & Framework: TIBCO Composite, JSF, Spring, Web Services, Selenium, JUnit, Maven, Ant

Web Servers: Weblogic, Web Sphere, Tomcat, Oracle OC4J, Oracle Weblogic Server

IDE: Eclipse, Visual Studio, XCode, GIT


Confidential, Long Beach, CA

Big Data Developer


  • Worked on a live 30 node (Prod) and 6 node (UAT) big data production cluster CDH 5.13.3
  • Developed and maintained the complex Claims Semantic Pipeline for weekly full load and incremental loads
  • Weekly full load of claims is validated against Netezza and verified for any discrepancies
  • Resolved the state issue (Universal and Medicare state) in the data set for reference for all the pipelines
  • Developed the aggregated datasets and lookup columns from Claims dataset and all reference tables
  • Integrated SIU pipeline into the existing Claims pipeline and retired the SIU pipeline
  • Used windowing techniques and UDFs in SparkSQL
  • Develop and in corporate the enhancements into the existing claims pipeline
  • Monitor and maintain weekly talend job and resolve any failures to meet the SLAs
  • Convert existing SQL logic to SparkSQL for Pharmacy pipeline and optimize it
  • Improve the performance of Provider datasets and incorporated all the provider data into claims
  • Worked with PARQUET file formats using SNAPPY compression to fasten network transfer of big data
  • Created Hive tables and views using Impala. Implemented partitioning, bucketing in Hive for better organization of data
  • Build Power BI dashboards to validate the data against Netezza
  • Currently in the process of automating the check before the start of pipeline to validate the L0 data
  • Collaborated with Data Management team on the business requirements and retirement of Netezza
  • Follow Agile Scrum methodology in JIRA during project
  • Gained very good business knowledge on claim processing

Confidential, Houston, TX

Big Data Developer


  • Involved in the complete SDLC of Big data project that includes requirement analysis, design, coding, testing and production.
  • Worked on a live 24 nodes and 4 nodes (Test) big data cluster of type Hadoop 3.6 on Linux.
  • Experience working on both Non-domain and domain joined clusters.
  • Worked with highly unstructured, structured and semi structured data of 30 TB in size (90 TB with replication factor of 3)
  • Ingested structured data from TIBCO Composite Data Virtualization tool into ADLS using Sqoop
  • Created Shell scripts to automate the Sqoop jobs.
  • Developed Ambari workflows for scheduling and orchestrating the ETL process
  • Worked with ORC file formats using ZLIB compression to fasten network transfer of big data
  • Ingested structured big data from Teradata, Oracle, Netezza, Postgres, SQLServer into ADLS using Azure Data Factory (ADF).
  • Created pipelines in ADF to create cluster, ingest, create hive tables, enable daily triggers.
  • Involved in converting Hive queries into Spark transformations using Spark Structured API.
  • Used PySpark (Python) and Scala for analyzing the data in Non-domain joined Spark 2.3 cluster
  • Scripted Python Code to transfer data from Hive tables into Data Science Sandbox using SFTP.
  • Very good experience in monitoring and managing the Hadoop cluster using Ambari.
  • Created dashboards in Power BI based on the Incident record data to generate metrics and Hive tables using ODBC connection.
  • Gained very good business knowledge on oil and gas industry, well pad, weather, mud pressure and exploration analysis.
  • Collaborated with Digital Security, Data Scientists, Palantir and Catalog team to ensure data quality and availability.
  • Follow Agile Scrum methodology in Visual Studio Team Services during the course of project.

Confidential, Orlando, FL

Hadoop Developer


  • Worked on a live 80 nodes Hadoop cluster running CDH5.10
  • Worked with structured and semi structured data of 150 TB in size (450 TB with replication factor of 3)
  • Created and worked Sqoop jobs with incremental load to populate Hive External tables.
  • Extensive experience in writing Pig scripts to transform raw data from several data sources into forming baseline data.
  • Developed Hive queries and UDFs to analyze/transform the data in HDFS.
  • Very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance
  • Established custom MapReduce programs to analyze data and used Pig Latin to clean unwanted data
  • Used Pattern matching algorithms in PIG to recognize the fraudulent customer across different sources and built risk profiles for each customer and stored the result data into HDFS
  • Used Oozie to orchestrate the MapReduce jobs and worked with HCatalog to open up access to Hive's Metastore


Software Development Engineer


  • Worked on 10 nodes Hadoop Cluster
  • Worked on semi structured and structured data of 15TB in size (45TB with replication factor of 3)
  • Loaded data from disparate data sets using Sqoop and flume.
  • Used sqoop to import/export data between RDBMS and hive tables.
  • Imported logs from web servers with Flume to ingest the data into HDFS.
  • Created Sqoop jobs with incremental load to populate Hive External tables.
  • Have a very good understanding of Partitions, bucketing concepts in Hive and designed both Managed and External tables in Hive to optimize performance.
  • Writing Pig Latin Scripts to perform transformations as per the use case requirement.
  • Worked with different file formats and compression techniques.

Environment: Cloudera Enterprise, Hadoop, MapReduce, Pig, Hive, Avro, Sqoop, HBase


Member Technical Staff


  • Created functional and design specification documents.
  • Analyzed on how to display the data/metrics collected on Enterprise Management (EM) and develop the relevant pages.
  • Worked on User-Interface using JSPs and Servlets for the Enterprise Manager framework
  • Discover all the Universal Content Management servers installed on the content server and identify their statuses.
  • Extracted the configuration details of the server.
  • Integrate the targets (SOA, WebLogic, WebCenter) to the EM Tree.
  • Create Dynamic Monitoring Services (DMS) messages for the Content Management.
  • Add the DMS instrumentation to the Content Server code to extract the metrics and validating and testing them.
  • Identified the cached queries, active databases, documents waiting, and number of service requests in the Content Server
  • Analyzed the system performance and monitor system status.
  • Used Oracle Application Development Framework (ADF) for end-to-end Java-based application development.
  • Resolve the issues on the server based on the priority.

Environment: Java, J2EE(Servlets), OOPS concepts, Oracle DB, JDBC




  • Prepare Requirement, Functional and Design Specification documents.
  • Worked on Oracle JDeveloper, which is a free integrated development environment.
  • Dynamic peer discovery has to be done both statically and dynamically (Using SLP and NAPTR)
  • Created Realm and Peer routing tables.
  • Invoked TCP Connection to send and receive data over it.
  • Test each method with JUnit.
  • Used EMMA Code Coverage to help improve the coverage of the Project.
  • Implemented Failover and Failback procedures

Environment: Java, J2EE (Socket Programming), Design Patterns, Seagull Traffic generator

Hire Now