We provide IT Staff Augmentation Services!

Big Data Architect Resume

Mt Laurel, NJ


  • Over 10+ years of experience in IT industry, including Big data environment, Hadoop ecosystem, Java and Design, Developing, Maintenance of various applications.
  • Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
  • Expertise in core Java, JDBC and proficient in using Java API's for application development.
  • Experience includes development of web based applications using Core Java, JDBC, Java Servlets, JSP, Struts Framework, Hibernate, HTML, JavaScript, XML and Oracle. .
  • Expertise in Java Script, JavaScript MVC patterns, Object Oriented JavaScript Design Patterns and AJAX calls.
  • Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.
  • Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web - based dashboards and reports.
  • Good working experience in Application and web Servers like JBoss and Apache Tomcat.
  • Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
  • Expertise in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL
  • Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.
  • Experience in using PL/SQL to write Stored Procedures, Functions and Triggers.
  • Experience in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and Map Reduce open source toolsbricks
  • Experience in installation, configuration, supporting and managing Hadoop clusters.
  • Experience in working with Map Reduce programs using Apache Hadoop for working with Big Data.
  • Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
  • Strong hands on experience with AWS services, including but not limited to EMR, S3, EC2, route 53, RDS, ELB, Dynamo DB, Cloud Formation, etc.
  • Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
  • Worked on Spark SQL, Spark Streaming and using Core Spark API to explore Spark features to build data pipelines.
  • Experienced in working with different scripting technologies like Python, Unix shell scripts.
  • Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQL Server, Teradata and Netezza using Sqoop.
  • Excellent Knowledge in understanding Big Data infrastructure, distributed file systems -HDFS, parallel processing - Map Reduce framework.
  • Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, NetBeans
  • Expert in Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
  • Experience in installation, configuration, supporting and managing - Cloudera Hadoop platform along with CDH4 & CDH5 clusters.
  • Strong experience and knowledge of NoSQL databases such as MongoDB and Cassandra.
  • Experience in working with different data sources like Flat files, XML files and Databases.
  • Experience in database design, entity relationships, database analysis, programming SQL, stored procedures PL/ SQL, packages and triggers in Oracle.


Big Data Ecosystem: MapReduce, HDFS, HIVE 2.3, Hbase 1.2 Pig, Sqoop, Flume 1.8, HDP, Oozie, Zookeeper, Spark, Kafka, storm, Hue Hadoop Distributions Cloudera (CDH3, CDH4, CDH5), Hortonworks

Cloud Platform: Amazon AWS, EC2, Redshift

Databases: Oracle 12c/11g, MySQL, MS-SQL Server2016/2014

Version Control: GIT, GitLab, SVN

Java/J2EE Technologies:: Servlets, JSP, JDBC, JSTL, EJB, JAXB, JAXP, JMS, JAX-RPC, JAX- WS

NoSQL Databases: HBase and MongoDB

Programming Languages:: Java, Python, SQL, PL/SQL, AWS, HiveQL, UNIX Shell Scripting, Scala.

Methodologies:: Software Development Lifecycle (SDLC), Waterfall Model and Agile, STLC (Software Testing Life cycle) & UML, Design Patterns (Core Java and J2EE)

Web Technologies: JavaScript, CSS, HTML and JSP.

Operating Systems:: Windows, UNIX/Linux and Mac OS.

Build Management Tools:: Maven, Ant.

IDE & Command line tools: Eclipse, Intellij, Toad and NetBeans.


Confidential, Mt Laurel, NJ

Big Data Architect


  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Implemented MapReduce programs to retrieve results from unstructured data set.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
  • Importing and exporting data into HDFS and Hive using Sqoop from Oracle and vice versa.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Worked on reading multiple data formats on HDFS using Scala.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
  • Designed both 3NF data models for ODS, OLTP systems and dimensional data models using star and snowflake Schema.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
  • Experienced in implementing POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
  • Monitored/resolved Data warehouse ETL Production issues. Worked in analyzing the issue, trace the problem area, suggest a solution, discuss with business.
  • Developed Spark scripts by using Python and Scala shell commands as per the requirement.
  • Experienced with batch processing of data sources using Apache Spark, Elastic search.
  • Experienced in AWS cloud environment and on S3 storage and EC2 instances
  • Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Designed and implemented SOLR indexes for the metadata that enabled internal applications to reference Scopus content.
  • Used Spark for Parallel data processing and better performances using Scala.
  • Extensively used Pig for data cleansing and extract the data from the web server output files to load into HDFS.
  • Developed a data pipeline using Kafka and Storm to store data into HDFS.
  • Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using MapReduce programs.
  • Developed simple to complex MapReduce streaming jobs using Python.

Environment: Pig 0.17, Hive 2.3, HBase Admin, PySpark, HBase 1.2, Metadata, ShellScripts, TypeScripts, Sqoop 1.4, Flume 1.8, Cassandra 3.11, AWS Dynamo DB, AWS Lambda, AWS EMR, Java, Oracle, Hadoop, MongoDB, Pivotal Cloud Foundry, ETL, zookeeper, BitBucket, AWS,Qlikview, MapReduce, HDFS, Governance, Oracle, Cloudera, Scala, Spark 2.3, SQL, Apache Kafka 1.0.1, Apache Storm, Python, Unix and SOLR 7.2

Confidential, Austin, Texas

Lead Big Data Developer


  • Contributing to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop and other big data technologies for leading organizations using major Hadoop Distributions like Hortonworks.
  • Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
  • Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
  • Created Hive external tables on the MapReduce output before partitioning; bucketing is applied on top of it.
  • Developed business specific Custom UDF's in Hive, Pig.
  • Developed end to end architecture design on big data solution based on variety of business use cases
  • Worked as a Spark Expert and performance Optimizer
  • Member of Spark COE (Center Of Excellence) in Data Simplification project at Cisco
  • Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark YARN
  • Handled Data Skewness in Spark-SQL
  • Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data
  • Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data
  • Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data also developed Spark jobs and Hive Jobs to summarize and transform data
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark
  • Implemented Sqooping from Oracle to Hadoop and load back in parquet format
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data; Worked under Mapr Distribution and familiar with HDFS
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • Designed and maintained Tez workflows to manage the flow of jobs in the cluster.
  • Worked with the testing teams to fix bugs and ensured smooth and error-free code
  • Preparation of docs like Functional Specification document and Deployment Instruction documents
  • Fixed defects during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
  • Involved in installing Hadoop Ecosystem components (Hadoop, MapReduce, Spark, Pig, Hive, Sqoop, Flume, Zookeeper and HBase).
  • Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.

Environment: AWS S3, RDS, EC2, Redshift, HDP 2.6x/HDF 3.x versions, Hadoop 3.0, Angular 5/7, Hive 2.3, Pig, Sqoop 1.4.6, Oozie, Hbase 1.2, Flume 1.8, Hortonworks, MapReduce, Kafka, HDFS, Oracle 12c, Microsoft, Java, GIS, Spark 2.2, Zookeeper

Confidential, Rocky Hill, CT

Sr. Spark/Hadoop Developer


  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Big Data technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
  • Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
  • Developer full SDLC of AWS Hadoop cluster based on client's business need
  • Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
  • Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra)
  • Responsible for importing log files from various sources into HDFS using Flume
  • Analyzed data using HiveQL to generate payer by reports for transmission to payer's form payment summaries.
  • Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
  • Performed data profiling and transformation on the raw data using Pig, Python, and Java.
  • Developed predictive analytic using Apache Spark Scala APIs.
  • Involved in working of big data analysis using Pig and User defined functions (UDF).
  • Created Hive External tables and loaded the data into tables and query data using HQL.
  • Implemented Spark GraphX application to analyze guest behavior for data science segments.
  • Enhancements to traditional data warehouse based on STAR schema, update data models, perform Data Analytics and Reporting using Tableau.
  • Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Designed and developed UI screens using Struts, DOJO, JavaScript, JSP, HTML, DOM, CSS, and AJAX.
  • Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with CSV, JSON, parquet and HDFS files.
  • Developed HiveQL scripts for performing transformation logic and also loading the data from staging zone to landing zone and Semantic zone.
  • Involved in creating Oozie workflow and Coordinator jobs for Hive jobs to kick off the jobs on time for data availability.
  • Worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.
  • Exported the generated results to Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
  • Managed and lead the development effort with the help of a diverse internal and overseas group.

Environment: Big Data, Spark, YARN, PySpark,HIVE, Pig, JavaScript, JSP, HTML, Ajax, Scala, Python, Hadoop, AWS, Dynamo DB, Kibana, Cloudera, EMR, JDBC, Redshift, NOSQL, Sqoop, MYSQL.

Confidential, Tampa, FL

Hadoop Developer


  • Developed Sqoop scripts for the extractions of data from various RDBMS databases into HDFS.
  • Developed scripts to automate the workflow of various processes using python and shell scripting.
  • Collected and aggregate large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Wrote Hive join query to fetch info from multiple tables, writing multiple Map Reduce jobs to collect output from Hive
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting on the dashboard.
  • Used AWS Cloud and On-Premise environments with Infrastructure Provisioning/ Configuration.
  • Worked on writing Perl scripts covering data feed handling, implementing mark logic, communicating with web services through SOAP Lite module and WSDL.
  • Developed data pipeline using Pig and Hive from Teradata, DB2 data sources. These pipelines had customized UDF'S to extend the ETL functionality.
  • Used UDF's to implement business logic in Hadoop by using Hive to read, write and query the Hadoop data in HBase.
  • Installed and configured Hadoop Ecosystem like Hive, Oozie, Sqoop by which implemented using Cloudera Hadoop cluster for helping with performance tuning and monitoring.
  • Used Oozie workflow engine to run multiple Hive and Pig Scripts with the help of Kafka for the real-time processing of data to navigate through data sets in the HDFS storage by loading Log File data directly into HDFS using Flume.
  • Developed an end-to-end workflow to build a real time dashboard using Kibana, Elastic Search, Hive and Flume.
  • Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard
  • Involved in developing Map-reduce framework, writing queries scheduling map-reduce
  • Developed the code for Importing and exporting data into HDFS and Hive using Sqoop
  • Using Oozie for designing workflows and scheduling various jobs in the Hadoop ecosystem.
  • Developed Map Reduce programs in java for applying business rules on the data and optimizing them using various compression formats and combiners.
  • Using SparkSQL to create data frames by loading JSON data and analyzing it.
  • Developed Spark code using Scala and Spark-SQL for faster testing and data processing.
  • Installed and configured Hadoop and responsible for maintaining cluster and managing and reviewing Hadoop log files.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.

Environment: Hadoop, Hive, Zookeeper, Map Reduce, Sqoop, Pig 0.10 and 0.11, JDK1.6, HDFS, Flume, Oozie, DB2, HBase, Mahout, Scala.

Confidential - Paso Robles, CA

Sr. Java/J2EE Developer


  • Involved in a full life cycle Object Oriented application development - Object Modeling, Database Mapping, GUI Design.
  • Developed the J2EE application based on the Service Oriented Architecture.
  • Used Design Patterns like Singleton, Factory, Session Facade and DAO.
  • Handled importing of data from various data sources, performed transformations using Hive and loaded data into HDFS.
  • Developed applications in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Orchestrated hundreds of Sqoop scripts, python scripts, Hive queries using Oozie workflows and sub-workflows.
  • Moved all crawl data flat files generated from various retailers to HDFS for further processing.
  • Writing the script files for processing data and loading to HDFS.
  • Worked on requirement gathering, analysis and translated business requirements into technical design with Hadoop Ecosystem.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Creation of Java classes and interfaces to implement the system.
  • Created Hive tables to store the processed results in a tabular format.
  • Involved in gathering the requirements, designing, development and testing.
  • Completely involved in the requirement analysis phase.
  • Created External Hive Table on top of parsed data.
  • Developed various complex Hive JQuery as per business logic.
  • Developing complex hive queries using Joins and partitions for huge data sets as per business requirements and load the filtered data from source to edge node hive tables and validate the data.
  • Performed bucketing and partitioning of data using apache hive which saves the processing time and generating proper sample insights.
  • Optimized the Hive tables using optimization techniques like partitions and bucketing to provide better performance with Hive queries.
  • Used Log4j utility to generate run-time logs.
  • Wrote SAX and DOM XML parsers and used SOAP for sending and getting data from the external interface.
  • Deployed business components into WebSphere Application Server.
  • Developed Functional Requirement Document based on users' requirement.

Environment: Core Java, J2EE, JDK 1.6, spring 3.0, Hibernate 3.2, Tiles, AJAX, JSP 2.1, Eclipse 3.6, IBM WebSphere7.0, XML, XSLT, SAX, DOM Parser, HTML, UML, Oracle10g, PL/ SQL, JUnit.


ETL Developer


  • Analyzed business requirements and worked closely with various application teams and business teams to develop ETL procedures to convert and load the data from various legacy source system to target Business Data Warehouse.
  • Created a BTEQ script for pre population of the work tables prior to the main load process.
  • Developed ETL code, control files, metadata, and lineage diagrams for ETL programs.
  • Developed Design documents and ETL mapping documents.
  • Developed and documented Informatica mappings and Informatica sessions as per the business requirement.
  • Generated PL/SQL scripts and UNIX Shell scripts for automated daily load processes.
  • Created Informatica Mappings to load data using transformations like Source Qualifier, Sorter, Aggregator, Expression, Joiner, Filter, Sequence, Router, Update Strategy, Lookup transformations.
  • Created reusable sessions and executed them to load the data from the source system using Informatica Workflow Manager
  • Extracted client information data and history from Flat files, Oracle, SQL Server transformed and loaded into Oracle staging area.
  • Worked on Key performance Indicators (KPIs), design of star schema and snowflake schema in Analysis Services (SSAS).
  • Developed MLOAD scripts to load data from Load Ready Files to Teradata Warehouse.
  • Wrote shell scripts to work with flat files, to define parameter files and to create pre and post session commands.
  • Designed, developed, maintained and supported Data Warehouse or OLTP processes via Extract, Transform and Load (ETL) software using Informatica.
  • Unit testing and System Testing of Informatica mappings and Workflows.
  • Extensively involved in performance tuning at source, target, mapping, session and system levels by analyzing the reject data.
  • Extensively worked on ETL performance tuning for tune the data load, worked with DBAs for SQL query tuning etc.
  • Involved in code review and code migration and in code deployment in different environments.
  • Involved in defects fixes and defects triage meetings with stake holders and managers. Experience in working with business stakeholders, application developers, and production teams and across functional units to identify business needs and discuss solution options.
  • Performed Relational Data modeling, dimensional modeling, OLAP multidimensional Cube design & analysis, define slowly changing dimensions and surrogate key management.

Environment: ETL, Oracle, SQL, PL/SQL, UNIX, BTEQ, Teradata, OLAP, OLTP, SSAS.

Hire Now