We provide IT Staff Augmentation Services!

Big Data/hadoop Developer Resume

Albany, NewyorK


  • 8 years of experience Big Data Development in analysis, design, development and implementation of large - scale applications with focus on Big Data technologies such as Apache Spark, Hadoop, Hive, Pig, Sqoop, Oozie, HBase, Zookeeper, Python & Scala.
  • Over all about 14+ Years of experience working in Various Databases including Oracle, SQL, and Big Data.
  • Experience in Analysis, Development, Testing, Implementation, Maintenance and Enhancements on various IT Data Warehousing Projects.
  • Strong experience working with HDFS, MapReduce, Spark, AWS, Hive, Impala, Pig, Sqoop, Flume, Kafka, NIFI, Oozie, HBase, MSSQL and Oracle.
  • Excellent knowledge and working experience in Agile & Waterfall methodologies.
  • Excellent experience in Amazon EMR, Cloudera and Hortonworks Hadoop distribution and maintaining and optimized AWS infrastructure (EMR EC2, S3, EBS, Cloud Formation, Red Shift, and Dynamo DB)
  • Experienced in writing database objects like Stored Procedures, Functions, Triggers, PL/SQL packages and Cursors for Oracle, SQL Server, and MySQL & Sybase databases.
  • Strong understanding and strong knowledge in databases like HBase, Mongo DB & Cassandra.
  • Hands on experience with Hadoop, HDFS, Map Reduce and Hadoop Ecosystem (Pig, Hive, Oozie, Flume and HBase).
  • In-depth knowledge of Hadoop architecture and its components like YARN, HDFS, Name Node, Data Node, Job Tracker, Application Master, Resource Manager, Task Tracker and MapReduce programming paradigm.
  • Expertise in installation, configuration, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH3, CDH4) distributions, Horton works and on Amazon web services (AWS).
  • Expertise in developing Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, Yarn and Map Reduce programming paradigm.
  • Experience in using version control tools like Bitbucket, GIT, and SVN etc.
  • Experienced on major components in Hadoop Ecosystem including Hive, Sqoop, Flume & knowledge of MapReduce/HDFS Framework.
  • Used various IDE’s for developing environment like Eclipse, Net Beans, IntelliJ and Erwin for Data base Scheme Design.
  • Experienced in working with MapReduce Design patterns to solve complex MapReduce programs.
  • Excellent Knowledge in Talend Big data integration for business demands to work towards Hadoop and NoSQL
  • Worked with Yarn Queue Manager to allocate queue capacities for different service accounts.
  • Excellent Working Knowledge on Sqoop and Flume for Data Processing.
  • In-depth understanding of Spark Architecture including Spark Core, Spark SQL, Spark Streaming, Spark MLlib.
  • Expertise in loading the data from the different Data sources like (Teradata and DB2) into HDFS using Sqoop and load into partitioned Hive tables
  • Experience in job work-flow scheduling and monitoring tools like Oozie and Zookeeper.
  • Developed Spark applications using Scala and Python.
  • Worked on Classic and Yarn distributions of Hadoop like the Apache Hadoop, Cloudera CDH4 and CDH5.
  • Exposure in analyzing data using HiveQL, HBase and Map Reduce programs in Eclipse.
  • Experienced on Hadoop cluster maintenance including data and metadata backups, file system checks, commissioning and decommissioning nodes and upgrades.
  • Extensive experience writing custom MapReduce programs for data processing and UDFs for both Hive and Pig.
  • Strong experience in analyzing large amounts of data sets writing Pig scripts and Hive queries.
  • Installation of patches, Security fixes, packages on AIX and Linux
  • Experienced in importing and exporting data using Sqoop from HDFS to Relational Database.
  • Expertise in job workflow scheduling and monitoring tools like Oozie.
  • Experienced in Apache Flume for collecting, aggregating and moving huge chunks of data from various sources such as web server, telnet sources etc.
  • Extensively designed and executed SQL queries in order to ensure data integrity and consistency at the backend.
  • Strong experience in architecting batch style large scale distributed computing applications using tools like Flume, MapReduce, Hive etc.
  • Experience using various Hadoop Distributions (Cloudera, Hortonworks, MapReduce) to fully implement and leverage new Hadoop features.
  • Strong experienced in working with Unix/Linux environments, writing Python scripts.
  • Great team player and quick learner with effective communication, motivation, and organizational skills combined with attention to details and business improvements.


Programming Languages: Python, Scala, Shell Scripting, SQL, PL/SQL and Java.

Big Data Ecosystem: HDFS, HBASE, Hadoop, MapReduce, Hive, Pig, Sqoop, Impala, Cassandra, Oozie, Zookeeper, Flume, Storm, Spark and Kafka.

Databases: Oracle, SQL Server, MySQL, MongoDB, Cassandra, HBase

Web Services: Web Logic, WebSphere, Apache Tomcat

IDEs: Eclipse, NetBeans, IntelliJ

Cloud Technologies: AWS (EC2, S3 Bucket, AMI, Red Shift, Dynamo DB, Cloud Formation, EBS, Cloud Trail, Route 53, Lamdba), Azure

Methodologies: Software Development Lifecycle (SDLC), Waterfall, Agile, STLC (Software Testing Life cycle), UML, Design Patterns.

Other Tools: Maven, ANT, WSDL, REST, Git, Bitbucket.


Confidential, Albany, Newyork

Big Data/Hadoop Developer


  • Involved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing and analyzing the data in HDFS.
  • Developed Spark API to import data into HDFS from Teradata and created Hive tables.
  • Worked with Amazon EMR to process data directly from S3 to the Hadoop Distributed File System (HDFS) on Amazon EMR cluster by setting up the Spark Core for analysis work.
  • Worked on Lambda Architecture for both Batch processing and Real Streaming purposes.
  • Implemented various Hadoop Distribution environments such as Cloudera.
  • Worked on Hadoop eco-systems including Hive, Mongo DB, Zookeeper, Spark Streaming with MapR distribution.
  • Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
  • Designed AWS, Cloud migration, AWS EMR, Dynamo DB, Redshift and event processing using lambda function.
  • Analyzed the existing data flow to the warehouses and taking the similar approach to migrate the data into HDFS.
  • Developing and implementing Apache NIFI across various environments, written QA scripts in Python for tracking files.
  • Monitored workload, job performance and capacity planning using Cloudera Manager.
  • Followed agile software development with Scrum methodology.
  • Created Partitioning, Bucketing, and Map Side Join, Parallel execution for optimizing the hive queries decreased the time of execution from hours to minutes.
  • Managed and review data backups, Manage and review Hadoop log files Cloudera Cluster.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark
  • Involved in gathering requirements from client and estimating timeline for developing complex queries using Hive for logistics application.
  • Worked with cloud provisioning team on a capacity planning and sizing of the nodes (Master and Slave) for an AWS EMR Cluster.
  • Installed application on AWS EC2 instances and configured the storage on S3 buckets.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
  • Used Zookeeper for various types of centralized configurations.
  • Build Hadoop solutions for big data problems using MR1 and MR2 in YARN.
  • Analyzed the SQL scripts and designed the solution to implement using Pyspark.
  • Performed multiple MapReduce jobs in Sqoop and Hive for data cleaning and pre-processing.
  • Involved in data pipeline using Sqoop to ingest cargo data and customer histories into HDFS for analysis.
  • Worked on data using Sqoop from HDFS to Relational Database Systems and vice-versa. Maintaining and troubleshooting
  • Worked on S3 buckets on AWS to store Cloud Formation Templates and worked on AWS to create EC2 instances.
  • Involved in identifying job dependencies to design workflow for Oozie & YARN resource management.
  • Created Hive Tables, loaded claims data from Oracle using Sqoop and loaded the processed data into target database.
  • Exported data from HDFS to RDBMS via Sqoop for Business Intelligence, visualization and user report generation.
  • Responsible for loading the customer's data and event logs from Kafka into HBase using REST API.
  • Created build and deploy plans for the S3 bucket using Bamboo.
  • Analyzed the SQL scripts and designed the solution to implement using Pyspark
  • Used Hue to verify the data using the Amazon EMR Query cluster.
  • Implemented Map Reduce jobs in HIVE by querying the available data.
  • Extensively used Zookeeper as job scheduler for Spark jobs.
  • Configured Hive meta store with MySQL, which stores the metadata for Hive tables.
  • Performed data analytics in Hive and then exported those metrics back to Oracle Database using Sqoop.
  • Performance tuning of Hive queries, MapReduce programs for different applications.
  • Created custom UDF's for Spark and Kafka procedure for some of non-working functionalities in custom UDF into Scala in production environment.
  • Configured the parameters, defaults, parameters for different regions for creating the S3 buckets.
  • Developed workflows in Oozie and scheduling jobs in Mainframes by preparing data refresh strategy document & Capacity planning documents required for project development and support.
  • Worked with different actions in Oozie to design workflow like Sqoop action, pig action, hive action, python action.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Involved in exploration of new technologies like AWS, Apache Flint, and Apache NIFI etc, which can increase the business value.
  • Mastered Cloudera Distribution with numerous Open Source projects and prototype various applications that utilize modern Big Data tools.
  • Implemented Fair scheduler on the Job tracker to share the resources of the cluster for the map reduces jobs given by the users.
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Involved in collecting and aggregating large amounts of log data using Flume and staging data in HDFS for further analysis.
  • Implemented Flume custom interceptors to perform cleansing operations before moving data into HDFS.

Environment: HDFS, Cloudera, EMR, Hive, AWS (EC2, EMR, S3, Lambda, Red Shift, Dynamo DB, Cloud Formation), RDBMS, Pig, Sqoop, Linux, MySQL, Kafka, Spark, Scala, Oozie, Hadoop, MapReduce, HBase, Zookeeper, Flume, Python.

Confidential, Raleigh, North Carolina

Hadoop Developer


  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Developed MapReduce programs to parse the raw data, populate tables and store the refined data in partitioned tables.
  • Involved in creating Hive Tables, loading with data and writing Hive queries which will invoke and run MapReduce jobs in the backend.
  • Implemented Amazon AWS EC2, RDS, S3, RedShift, Cloud Trail, Route 53, etc., and worked with various Hadoop Tools like Hive, Pig, Sqoop, Oozie, HBase, Flume, PySpark.
  • Responsible for building and configuring distributed data solution using MapR distribution of Hadoop.
  • Automated the generation of HQL, creation of Hive Tables and loading data into Hive tables by using Apache NiFi and OOZIE.
  • Wrote Scripts for distribution of query for performance test jobs in Amazon Data Lake.
  • Created Hive Tables, loaded transactional data from Teradata using Sqoop.
  • Developed MapReduce (Yarn) jobs for cleaning, accessing and validating the data.
  • Created and worked Sqoop jobs with incremental load to populate Hive External tables.
  • Setup end to end ETL orchestration of this framework in AWS, using Spark Graph frames, Spark SQL, AWS hd, S3, EMR, Data pipeline, SNS, EC2, Redshift, IAM and VPC.
  • Written efficient serverless AWS lambda functions in python using Boto 3 API, to dynamically activate AWS Data
  • Developed optimal strategies for distributing the web log data over the cluster importing and exporting the stored web log data into HDFS and Hive using Sqoop.
  • Apache Hadoop installation & configuration of multiple nodes on AWS EC2 system
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and have a good experience in using Spark Streaming.
  • Developed Pig Latin scripts for replacing the existing legacy process to the Hadoop and the data is fed to AWS S3.
  • Responsible for building Scala distributed data solutions using Hadoop Cloudera.
  • Designed and developed automation test scripts using Python
  • Integrated Apache Storm with Kafka to perform web analytics and to perform click stream data from Kafka to HDFS.
  • Involved in Installing, Configuring Hadoop Eco-System, Cloudera Manager using CDH3, CDH4 Distributions.
  • Implemented log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Used Zookeeper for various types of centralized configurations.
  • Developed MapReduce (Yarn) jobs for cleaning, accessing and validating the data.
  • Writing Pig-scripts to transform raw data from several data sources into forming baseline data.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark
  • Implemented Hive Generic UDF's to in corporate business logic into Hive Queries.
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in Hdfs.
  • Worked on Apache Nifi as ETL tool for batch processing and real time processing.
  • Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
  • Performed various transformations and storage in Hadoop Architecture using HDFS, Map Reduce, PySpark.
  • Worked and learned a great deal from Amazon Web Services (AWS) Cloud services like EC2, S3, and EMR.
  • Used Apache Spark and Scala language to find patients with similar symptoms in the past and medications used for them to achieve results.
  • Worked on various map reduces framework architectures (MRV1 & YARN Architecture).
  • Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
  • Supported data analysis projects by using Elastic MapReduce on the Amazon Web Services (AWS) cloud performed Export and import of data into S3.
  • Implemented KBB's Big data ETL processes in AWS, using Hive, Spark, AWS Lambda, S3, EMR, Data pipeline, EC2, Redshift, Athena, SNS, IAM and VPC.
  • Implemented Budget cuts on AWS, by writing Lambda functions to automatically spin up and shut down the Redshift clusters.
  • Used Git for Source Code Management.
  • Integrated Kafka with PySpark Streaming for real time data processing.
  • Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication and Sharding features.
  • Worked with cloud provisioning team on a capacity planning and sizing of the nodes (Master and Slave) for an AWS EMR Cluster.
  • Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as MapReduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and Python scripts).
  • Worked on custom talent jobs to ingest, enrich and distribute data in Cloudera Hadoop ecosystem.
  • Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBase and Hive tables.
  • Worked on Cluster co-ordination services through Zookeeper.
  • Created featured develop release branches in GIT for different application to support releases and CI builds.
  • Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
  • Exported the analyzed data to the RDBMS using Sqoop for to generate reports for the BI team.
  • Worked collaboratively with all levels of business stakeholders to implement and test Big Data based analytical solution from disparate sources.

Environment: Hadoop, HDFS, Cloudera, Teradata r15, Sqoop, Linux, Yarn, MapReduce, AWS (EC2, RDS, S3, RedShift, Lambda, Cloud Trail, Route 53) Python, Kafka, Pig, SQL, Hive, HBase, Oozie, RDBMS, Spark, Spark Streaming, Scala, Zookeeper, Java.

Confidential, Chicago, Illinois

Hadoop Developer


  • Experienced in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Sqoop, Kafka, Spark, and Impala
  • Deployment and administration of Splunk and Hortonworks Distribution
  • Used Spark API over Cloud era Hadoop YARN to perform analytics on data
  • Implemented Data Ingestion in real time processing using Kafka
  • Developed a 300-node cluster in designing the Data Lake with the Hortonworks distribution
  • Expertise in integrating Kafka with Spark streaming for high speed data processing
  • Worked on NoSQL support enterprise production and loading data into HBase using Impala and Sqoop.
  • Developed multiple Kafka Producers and Consumers as per the software requirement specifications
  • Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs
  • Data profiling in SQL, Shell scripts for Database Solution with more than 40 sources, 30 million records and hundreds of attributes.
  • Developed Simple to complex MapReduce Jobs using Hive and Pig.
  • Extract Real time feed using Kafka and Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS
  • Analyzed large data sets by running Hive queries and Pig scripts.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala
  • Experience in using Solr and Zookeeper technologies
  • Creating files and tuned the SQL queries in Hive utilizing HUE (Hadoop User Experience).
  • Involved in unit testing using MR unit for MapReduce jobs.
  • Uploaded streaming data from Kafka to HDFS, HBase and Hive by integrating with storm.
  • Worked with Apache Spark which provides fast and general engine for large data processing integrated with functional programming language Scala
  • Used Impala to read, write and query the Hadoop data in Hive.
  • Created Linux shell Scripts to automate the daily ingestion of IVR data
  • Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs
  • Developed multiple POCs using Scala and deployed on the Yarn cluster, compared the performance of Spark, with Hive and SQL/Teradata.
  • Experience in performance tuning a Cassandra cluster to optimize it for writes and reads
  • Involved in loading data from LINUX file system to HDFS
  • Wrote Hive queries and ran both scripts in tez mode to improve performance on Hortonworks Data Platform.
  • Created UNIX Shell Scripts for automating deployments and other routine tasks.
  • Gathering User requirements and designing technical and functional specifications
  • Deployed an Apache Solr search engine server to help speed up the search of the government cultural asset.
  • Designed and Developed a Data Ingestion Pipeline to Load the Data into Apache SOLR on a daily basis.
  • Involved in collecting, aggregating and moving data from servers to HDFS using Apache Flume.
  • Worked on analyzing HUE cluster and different big data analytic tools including Pig, HBase database and Sqoop
  • Responsible for building scalable distributed data solutions using Hadoop
  • Implemented nine nodes CDH3 Hadoop cluster on Red hat Linux
  • Developed simple to complex MapReduce jobs using Java language for processing and validating the data.
  • Developed data pipeline using Sqoop, Spark, MapReduce, and Hive to ingest, transform and analyze operational data.
  • Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries and Pig Scripts.
  • Involved in loading data from Linux file system to HDFS
  • Worked on installing cluster, commissioning & decommissioning of data node, name node recovery, capacity planning, and slots configuration
  • Created HBase tables to store variable data formats of data coming from different portfolios
  • Implemented a script to transmit sysprin information from Oracle to HBase using Sqoop
  • Implemented Data Integrity and Data Quality checks in Hadoop using Hive and Linux scripts.
  • Implemented test scripts to support test driven development and Continuous Integration
  • Worked on tuning the performance Pig queries
  • Created ODBC connection through Sqoop between Hortonworks and SQL Server
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades

Environment: Hadoop, Hortonworks, HDFS, Map Reduce, Pig, Hive, Pig scripts, YARN, Sqoop, Oozie, Shell Scripting, Impala, Spark, Spark-SQL, HBase, Cassandra, SOLR, Zookeeper, Scala, Kafka, Cloud era, GIT.

Confidential, Wisconsin

Sr. Database Administrator


  • Responsibilities included maintenance for both type of databases OLTP and DSS (data warehouse, data mart).
  • Migrated databases from HP-UX 11.11 PA-RISC to HP Itanium11.23, moved Oracle 10g OEM Grid control packages to Itanium and re-installed OMS & agents.
  • Performed backup and restore using RMAN, cloning, refreshed schema objects, and installed and configured Oracle 10g RAC on two nodes cluster, maintained Data Guard standby database using Data Guard broker. Participated in data modeling using Erwin (reverse and forward engineering), compared database changes and applied new builds on database, maintain version control for builds.
  • Participated in performance monitoring and tuning, diagnosed and resolved issues accordingly.

Environment: (Oracle 9i, 10g, 10g Grid Control, HP-UX, AIX, Solaris, Linux, Websphere, Toad, SQL Server, DB2)

Hire Now