We provide IT Staff Augmentation Services!

Sr. Hadoop/ Big Data Developer Resume

Boyertown, PA


  • Over 9 years of experience in IT industry, including Big data environment, Hadoop ecosystem, Java and Design, Developing, Maintenance of various applications.
  • Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java into Pig Latin and HQL (HiveQL).
  • Expertise in core Java, JDBC and proficient in using Java API's for application development.
  • Expertise in Java Script, JavaScript MVC patterns, Object Oriented JavaScript Design Patterns and AJAX calls.
  • Good experience in Tableau for Data Visualization and analysis on large data sets, drawing various conclusions.
  • Leveraged and integrated Google Cloud Storage and Big Query applications, which connected to Tableau for end user web - based dashboards and reports.
  • Good working experience in Application and web Servers like JBoss and Apache Tomcat.
  • Good Knowledge in Amazon Web Service (AWS) concepts like Athena, EMR and EC2 web services which provides fast and efficient processing of Teradata Big Data Analytics.
  • Expertise in Big Data architecture like Hadoop (Azure, Hortonworks, Cloudera) distributed system, MongoDB, NoSQL and HDFS, parallel processing - MapReduce framework
  • Development of Spark-based application to load streaming data with low latency, using Kafka and Pyspark programming.
  • Hands on experience on Hadoop /Big Data related technology experience in Storage, Querying, Processing and analysis of data.
  • Experience in development of Big Data projects using Hadoop, Hive, HDP, Pig, Flume, Storm and MapReduce open-source tools.
  • Experience working with GitHub/Git source and version control systems.
  • Experience in installation, configuration, supporting and managing Hadoop clusters.
  • Experience in working with MapReduce programs using Apache Hadoop for working with Big Data.
  • Experience in developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Talend Integration Suite.
  • Experience in installation, configuration, supporting and monitoring Hadoop clusters using Apache, Cloudera distributions and AWS.
  • Strong hands-on experience with AWS services, including but not limited to EMR, S3, EC2, route53, RDS, ELB, DynamoDB, CloudFormation, etc.
  • Excellent working experience in Scrum / Agile framework and Waterfall project execution methodologies.
  • Hands on experience in Hadoop ecosystem including Spark, Kafka, HBase, Scala, Pig, Impala, Sqoop, Oozie, Flume, Storm, big data technologies.
  • Worked on Sparks, Spark Streaming and using CoreSparkAPI to explore Spark features to build data pipelines.
  • Experienced in working with different scripting technologies like Python, UNIX shell scripts.
  • Good Knowledge in Amazon Web Service (AWS) concepts like EMR and EC2 web services successfully loaded files to HDFS from Oracle, SQLServer, Teradata and Netezza using Sqoop.
  • Extensive knowledge in working with IDE Tools such as My Eclipse, RAD, IntelliJ, NetBeans.
  • Expert in Amazon EMR, Spark, Kinesis, S3, ECS, Elastic Cache, Dynamo DB, Redshift.
  • Experience in installation, configuration, supporting and managing -Cloudera Hadoop platform along with CDH4&CDH5 clusters.
  • Proficiency in multiple databases like NoSQL databases (MongoDB, Cassandra), MySQL, ORACLE, and MS SQL Server.
  • Experience in database design, entity relationships and database analysis, programming SQL, stored procedures PL/SQL, packages and triggers in Oracle.
  • Experience in working with different data sources like Flat files, XML files and Databases.
  • Ability to tune Big Data solutions to improve performance and end-user experience.
  • Having working experience with Building RESTful web services, and RESTful API.
  • Managed multiple tasks and worked under tight deadlines and in fast pace environment.
  • Excellent analytical, communication skills which helps to understand the business logics and develop a good relation between stakeholders and team members.
  • Strong communication skills, analytic skills, good team player and quick learner, organized and self-motivated.


Big Data Ecosystems: Hadoop, MapReduce, HDFS, HBase, Zookeeper, Hive, Pig, Sqoop, Cloudera, Hortonworks, Yarn, Cassandra, Oozie, Storm, and Flume.

Streaming Technologies: Spark, Kafka, Storm

Scripting/Programming Languages: Cassandra, Python, Scala, Regular Expressions, Shell scripting, PL/SQL, R, PySpark and Bash, Java, SQL, Java Scripting, HTML, CSS.

Databases: Data warehouse, RDBMS, NoSQL (Certified MongoDB), Oracle, HBase, Snowflake, MySQL.

Java/J2EE Technologies: Servlets, JSP (EL, JSTL, Custom Tags), JSF, Apache Struts, Junit, Hibernate 3.x, Log4J Java Beans, EJB 2.0/3.0, JDBC, RMI, JMS, JNDI.

Tools: Eclipse, JDeveloper, MS Visual Studio, Microsoft Azure HDInsight, Microsoft Hadoop cluster, JIRA, NetBeans, Eclipse.

Methodologies: Agile, Scrum, Waterfall

Operating Systems: Unix/Linux

Machine Learning Skills (MLlib): Feature Extraction, Dimensionality Reduction, Model Evaluation, Clustering


Sr. Hadoop/ Big Data Developer

Confidential, Boyertown, PA


  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Responsible for Big data initiatives and engagement including analysis, brainstorming, POC, and architecture.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Installed and Configured Apache Hadoop clusters for application development and Hadoop tools.
  • Installed and configured Hive and written Hive UDFs and used repository of UDF's for Pig Latin.
  • Developed data pipeline using Pig, Sqoop to ingest cargo data and customer histories into HDFS for analysis.
  • Migrated the existing on-perm code to AWS EMR cluster.
  • Installed and configured Hadoop Ecosystem components and Cloudera manager using CDH distribution.
  • Coordinated with Hortonworks support team through support portal to sort out the critical issues during upgrades.
  • Involved in all phases of SDLC using Agile and participated in daily scrum meetings with cross teams.
  • Worked on modeling of Dialog process, Business Processes and coding Business Objects, Query Mapper and JUnit files.
  • Created automated pipelines in AWS Code Pipeline to deploy Docker containers in AWS ECS using S3.
  • Used HBase NoSQL Database for real time and read/write access to huge volumes of data in the use case.
  • Extracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into HBase.
  • Developed AWS Lambda to invoke glue job as soon as a new file is available in Inbound S3 bucket.
  • Created spark jobs to apply data cleansing/data validation rules on new source files in inbound bucket and reject records to reject-data S3 bucket.
  • DevelopedAWScloud formation templates and setting up Auto scaling forEC2 instancesand involved in the automated provisioning of AWS cloud environment usingJenkins.
  • Developed frontend and backend modules using Python on Django using Git.
  • Created HBase tables to load large sets of semi-structured data coming from various sources.
  • Responsible for loading the customer's data and event logs from Kafka into HBase using REST API.
  • Created tables along with sort and distribution keys in AWS Redshift.
  • Created shell scripts and python scripts to automate our daily tasks (includes our production tasks as well)
  • Created, altered and deleted topics using Kafka Queues when required with varying.
  • Used cloud computing on the multi-node cluster and deployed Hadoop application on cloud S3 and used Elastic Map Reduce (EMR) to run a MapReduce.
  • Developed analytics enablement layer using ingested data that facilitates faster reporting and dashboards.
  • Worked with production support team to provide necessary support for issues with CDH cluster and the data ingestion platform.
  • Created Hive External tables to stage data and then move the data from Staging to main tables
  • Implemented the Big Data solution using Hadoop, hive and Informatica to pull/load the data into the HDFS system.
  • Pulling the data from Hadoop data lake ecosystem and massaging the data with various RDD transformations.
  • Used Spark API over Hortonworks Hadoop YARN to perform analytics on data in Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
  • Developed and maintained batch data flow using HiveQL and Unix scripting.
  • Experienced in writing and deploying UNIX Korn Shell Scripts as part of the standard ETL processes and for job automation purposes.
  • Designed and Developed Real time processing Application using Spark, Kafka, Scala and Hive to perform streaming ETL and apply Machine Learning.
  • Developed and execute data pipeline testing processes and validate business rules and policies.
  • Built code for real time data ingestion using MapR-Streams.
  • Implemented Spark using Python and Spark SQL for faster processing of data.
  • Automation of unit testing using Python. Different testing methodologies like unit testing, Integration testing.
  • Used HIVE join queries to join multiple tables of a source system and load them into Elastic Search Tables.
  • Implemented different data formatter capabilities and publishing to multiple Kafka Topics.
  • Extensively worked on Jenkins to implement Continuous Integration (CI) and Continuous Deployment (CD) processes.
  • DevelopedAWScloud formation templates and setting up Auto scaling forEC2 instancesand involved in the automated provisioning of AWS cloud environment usingJenkins.
  • Written automated HBase test cases for data quality checks using HBase command line tools.
  • Involved in development of Hadoop System and improving multi-node Hadoop Cluster performance.
  • Developed and implemented Apache NIFI across various environments, written QA scripts in Python for tracking files.

Environment: Hadoop 3.0, MapReduce, Hive 3.0, Agile, HBase 1.2, NoSQL, AWS, Kafka, Pig 0.17, HDFS, Java 8, Hortonworks, Spark, PL/SQL, Python, Jenkins.

Big Data Engineer

Confidential, Athens, AL


  • Evaluated business requirements and prepared detailed specifications that follow project guidelines required to develop written programs.
  • Loaded and transformed large sets of structured, semi structured and unstructured data using Hadoop/Big Data concepts.
  • Created the automated build and deployment process for application, re-engineering setup for better user experience, and leading up to building a continuous integration system.
  • Implemented MapReduce programs to retrieve results from unstructured data set.
  • Optimized MapReduce Jobs to use HDFS efficiently by using various compression mechanisms.
  • Worked on and designed Big Data analytics platform for processing customer interface preferences and comments using Hadoop, Hive and Pig, Cloudera.
  • Importing and exporting data into HDFS and Hive using Sqoop from Oracle, MongoDB and vice versa.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Worked on reading multiple data formats on HDFS using Scala.
  • Involved in converting Hive/SQLqueries into Spark transformations using Spark RDDs and Scala.
  • Installed and configured Pig and also written Pig Latin scripts.
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL.
  • Analyzed the SQL scripts and designed the solution to implement using Scala.
  • Build data platforms, pipelines, and storage systems using the Apache Kafka, Apache Storm and search technologies such as Elastic search.
  • Worked on querying data using Spark SQL on top of PySpark engine.
  • Experienced in implementing POC's to migrate iterative MapReduce programs into Spark transformations using Scala.
  • Created Pipelines inADFusingLinked Services/Datasets/Pipeline/ to Extract, Transform and load data from different sources likeAzure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
  • Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake, Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
  • Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed Spark scripts by using Python and Scala shell commands as per the requirement.
  • Experienced with batch processing of data sources using Apache Spark, Elastic search.
  • Designed dimensional data models using Star and Snowflake Schemas.
  • Developed Spark jobs using Scala in test environment for faster data processing and used Spark SQL for querying.
  • Configured Spark streaming to receive real time data from the Kafka and store the stream data to HDFS.
  • Worked on Snowflake environment to remove redundancy and load real time data from various data sources into HDFS using Kafka.
  • Designed and implemented SOLR indexes for the metadata that enabled internal applications to reference Scopus content.
  • Used Spark for Parallel data processing and better performances using Scala.
  • Extensively used Pig for data cleansing and extract the data from the web server output files to load into HDFS.
  • Implemented a fully operational production grade large scale data solution on Snowflake Data Warehouse.
  • Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
  • Involved in creating Hive tables, loading with data and writing hive queries which will run internally in MapReduce way.
  • Used Kubernetes to orchestrate the deployment, scaling and management of Docker Containers.
  • Azure Kubernetes Service was used to deploy a managed Kubernetes cluster in Azure, and built an Azure portal AKS cluster with Azure CLI, and also used template-driven deployment options such as templates for the Resource Manager and Terraform.
  • Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it using MapReduce programs.
  • Developed simple to complex MapReduce streaming jobs using Python.

Environment: Pig 0.17, Hive 2.3, HBase 1.2, Sqoop 1.4, Flume 1.8, zookeeper, Azure, ADF, Blob, cosmos DB, MapReduce, HDFS, Cloudera, Scala, Spark 2.3, SQL, Apache Kafka 1.0.1, Apache Storm, Python, Unix.

Sr. Hadoop/ Big Data Developer

Confidential, foster City, CA


  • Contributing to the development of key data integration and advanced analytics solutions leveraging Apache Hadoop and other big data technologies for leading organizations using major Hadoop Distributions like Hortonworks.
  • Involved in Agile methodologies, daily Scrum meetings, Sprint planning.
  • Performed Data transformations in HIVE and used partitions, buckets for performance improvements.
  • Created Hive external tables on the MapReduce output before partitioning; bucketing is applied on top of it.
  • Developed business specific Custom UDF's in Hive, Pig.
  • Developed end to end architecture design on bigdata solution based on variety of business use cases
  • Worked as a Spark Expert and performance Optimizer
  • Member of Spark COE (Center of Excellence) in Data Simplification project at Cisco
  • Experienced with Spark Context, Spark-SQL, Data Frame, Pair RDD's, Spark
  • Handled Data Skewness in Spark-SQL
  • Implemented Spark using Scala, Java and utilizing Data frames and Spark SQL API for faster processing of data
  • Developed Spark code and Spark-SQL/Streaming for faster testing and processing of data
  • Developed a data pipeline using Kafka, HBase, Spark and Hive to ingest, transform and analyzing customer behavioral data also developed Spark jobs and Hive Jobs to summarize and transform data
  • Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and Spark.
  • Implemented Sqooping from Oracle and MongoDB to Hadoop and load back in parquet format
  • Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL database for huge volume of data; Worked under Map Distribution and familiar with HDFS
  • Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive, MapReduce and then loading data into HDFS.
  • Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • Designed and maintained Test workflows to manage the flow of jobs in the cluster.
  • Worked with the testing teams to fix bugs and ensured smooth and error-free code.
  • Preparation of docs like Functional Specification document and Deployment Instruction documents.
  • Experience in making the Devops pipelines using Openshift and Kubernetes for the Microservices Architecture.
  • Fixed defects during the QA phase, support QA testing, troubleshoot defects and identify the source of defects.
  • Involved in installing Hadoop Ecosystem components (Hadoop, MapReduce, Spark, Pig, Hive, Sqoop, Flume, Zookeeper and HBase).
  • Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.

Environment: AWSS3, RDS, EC2, Redshift, Hadoop 3.0, Hive 2.3, Pig, Sqoop 1.4.6, Oozie, HBase 1.2, Flume 1.8, Hortonworks, MapReduce, Kafka, HDFS, Oracle 12c, Microsoft, Java, GIS, Spark 2.2, Zookeeper

Spark/Hadoop Developer

Confidential, St. Louis, MO


  • Implemented solutions for ingesting data from various sources and processing the Data-at-Rest utilizing Bigdata technologies such as Hadoop, Map Reduce Frameworks, HBase, and Hive.
  • Used Sqoop to efficiently transfer data between databases and HDFS and used Flume to stream the log data from servers.
  • Developer full SDLC of AWS Hadoop cluster based on client's business need
  • Involved in loading and transforming large sets of structured, semi structured and unstructured data from relational databases into HDFS using Sqoop imports.
  • Implement enterprise grade platform (mark logic) for ETL from mainframe to NOSQL (Cassandra)
  • Responsible for importing log files from various sources into HDFS using Flume
  • Analyzed data using HiveQL to generate payer by reports for transmission to payer's form payment summaries.
  • Imported millions of structured data from relational databases using Sqoop import to process using Spark and stored the data into HDFS in CSV format.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns.
  • Performed data profiling and transformation on the raw data using Pig, Python, and Java.
  • Developed predictive analytic using ApacheSparkScalaAPIs.
  • Involved in working of big data analysis using Pig and User defined functions (UDF).
  • Created Hive External tables and loaded the data into tables and query data using HQL.
  • Implemented Spark Graph application to analyze guest behavior for data science segments.
  • Enhancements to traditional data warehouse based on STAR schema, update data models, perform Data Analytics and Reporting using Tableau.
  • Involved in migration of data from existing RDBMS (oracle and SQL server) to Hadoop using Sqoop for processing data.
  • Developed Shell, Perl and Python scripts to automate and provide Control flow to Pig scripts.
  • Developed prototype for Big Data analysis using Spark, RDD, Data Frames and Hadoop eco system with CSV, JSON, parquet and HDFS files.
  • Developed Hive SQL scripts for performing transformation logic and loading the data from staging zone to landing zone and Semantic zone.
  • Involved in creating Oozieworkflow and Coordinator jobs for Hive jobs to kick off the jobs on time for data availability.
  • Worked on Oozie scheduler to automate the pipeline workflow and orchestrate the Sqoop, hive and pig jobs that extract the data on a timely manner.
  • Exported the generated results to Tableau for testing by connecting to the corresponding Hive tables using Hive ODBC connector.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
  • Managed and lead the development effort with the help of a diverse internal and overseas group.

Environment: Big Data, Spark, YARN, HIVE, Pig, JavaScript, JSP, HTML, Ajax, Scala, Python, Hadoop, AWS, Dynamo DB, Kibana, Cloudera, EMR, JDBC, Redshift, NOSQL, Sqoop, MYSQL.

Hadoop Developer/Admin



  • Involved in start to end process of Hadoop cluster setup where in installation, configuration and monitoring the Hadoop Cluster.
  • Automated Setup Hadoop Cluster, Implemented Kerberos security for various Hadoop services using Horton Works.
  • Responsible for Cluster maintenance, commissioning and decommissioning Data nodes, Cluster Monitoring, Troubleshooting, Manage and review data backups, Manage & review Hadoop log files.
  • Monitoring systems and services, architecture design and implementation of Hadoop deployment, configuration management, backup, and disaster recovery systems and procedures.
  • Installation of various Hadoop Ecosystems and Hadoop Daemons.
  • Responsible for Installation and configuration of Hive, Pig, HBase and Sqoop on the Hadoop cluster.
  • Configured various property files like core-site.xml, hdfs-site.xml, mapred-site.xml based upon the job requirement
  • Involved in loading data from UNIX file system to HDFS, Importing and exporting data into HDFS using Sqoop, experienced in managing and reviewing Hadoop log files.
  • Responsible for data extraction and data ingestion from different data sources into Hadoop Data Lake ecosystem by creating ETL pipelines using Pig, and Hive
  • Managed and reviewed Hadoop Log files as a part of administration for troubleshooting purposes. Communicate and escalate issues appropriately.
  • Extracted meaningful data from dealer csv files, text files, and mainframe files and generated Python panda's reports for data analysis.
  • Developed python code using version control tools like GIT hub and SVN on vagrant machines.
  • Performed data analysis, feature selection, feature extraction using Apache Spark Machine Learning streaming libraries in Python.
  • Involved in Analyzing system failures, identifying root causes, and recommended course of actions. Documented the systems processes and procedures for future references.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters. Involved in Installing and configuring Kerberos for the authentication of users and Hadoop daemons.

Environment: Horton Work, Hadoop, HDFS, Pig, Hive, Sqoop, Flume, Kafka, Storm, UNIX, Cloudera Manager, Zookeeper and HBase, Python, Spark, Apache, SQL, ETL

Big Data Developer



  • Involved in complete SDLC life cycle of big data project that includes requirement analysis, design, coding, testing and production
  • Extensively Used Sqoop to import/export data between RDBMS and hive tables, incremental imports and created Sqoop jobs for last saved value.
  • Established custom Map Reduces programs to analyze data and used Pig Latin to clean unwanted data.
  • Installed and configured Hive and wrote Hive UDF to successfully implement business requirements.
  • Involved in creating hive tables, loading data into tables, and writing hive queries those are running in MapReduce way.
  • Experienced with using different kind of compression techniques to save data and optimize data transfer over network using Lzo, Snappy, etc. in Hive tables.
  • Implemented custom interceptors for flume to filter data and defined channel selectors to multiplex the data into different sinks.
  • Experience in working with Spark SQL for processing data in the Hive tables.
  • Developing Scripts and Tidal Jobs to schedule a bundle (group of coordinators), which consists of various Hadoop Programs using Oozie.
  • Involved in writing test cases, implementing unit test cases.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs which run independently with time and data availability.
  • Hands on experience with Accessing and perform CURD operations against HBase data using Java API.
  • Analyzed the data by performing Hive queries and running Pig scripts to know user behavior.
  • Implemented POC to migrate map reduce jobs into Spark RDD transformations using Scala.
  • Developed spark applications using Scala for easy Hadoop transitions.
  • Extensively used Hive queries to query data according to the business requirement.
  • Used Pig for analysis of large data sets and brought data back to HBase by Pig

Environment: Hadoop, HDFS, Map Reduce, Hive, Flume, Sqoop, PIG, MySQL and Ubuntu, Zookeeper, CDH3/4 Distribution, Java Eclipse, Oracle, Shell Scripting.

Hire Now