We provide IT Staff Augmentation Services!

Sr. Spark/hadoop Developer Resume

Grapevine, TX


  • 8+ years of professional IT experience in analyzing requirements, designing, building, highly distributed mission critical products and applications.
  • Highly dedicated and results oriented Hadoop Developer with 4+ years of strong end - to-end experience on Hadoop Development with varying level of expertise around different BIGDATA Environment projects.
  • Expertise in core Hadoop and Hadoop technology stack which includes HDFS, MapReduce, Oozie, Hive, Sqoop, Pig, Flume, HBase, Spark, Kafka, and Zookeeper.
  • Having experience on RDD architecture and implementing spark operations on RDD and also optimizing transformations and actions in spark.
  • Reviewing and managing Hadoop log files by consolidating logs from multiple machines using flume.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Wrote Flume configuration files for importing streaming log data into HBase with Flume
  • Experience in importing and exporting data usingSqoop from HDFS to Relational Database Systems and vice-versa.
  • Experience in installation and setup of various Kafka producers and consumers along with the Kafka brokers and topics.
  • Hands on experience in application development using Java, RDBMS, and Linux shell scripting.
  • Experienced in managing Hadoop cluster using Cloudera Manager Tool.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
  • Experience with Oozie Workflow Engine in running workflow jobs with actions that run Java MapReduce and Pig jobs.
  • Great hands on experience withPysparkfor using Spark libraries by using python scripting for data analysis.
  • Implemented data science algorithms like shift detection in critical data points using Spark, doubling the performance.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems/mainframe and vice-versa.
  • Extending Hive and Pig core functionality by writing customUDFs.
  • Experience in analyzing data using HiveQL, Pig Latin, and custom Map Reduce programs in Java.
  • Experience in Apache Flume for efficiently collecting, aggregating, and moving large amounts of log data.
  • Involved in developing web-services using REST, HBase Native API Client to query data from HBase.
  • Experienced in working with structured data using Hive QL, join operations, Hive UDFs, partitions, bucketing and internal/external tables.
  • Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python/Java into Pig Latin and HQL (HiveQL).
  • Involved in converting Cassandra/Hive/SQL queries intoSparktransformations usingSparkRDD's in Scala and Python.
  • Used highly available AWS Environment to launch the applications in different regions and implemented Cloud Front with AWSLambda to reducelatency.
  • Implemented CRUD operations using CQL on top of Cassandra file system.
  • Used Cassandra CQL with Java APIs to retrieve data from Cassandra tables.
  • Set up Solr for distributing indexing and search
  • Used Solr to enable indexing for enabling searching on Non-primary key columns from Cassandra key spaces.
  • Excellent working Knowledge in Spark Core, Spark SQL, Spark Streaming.
  • Real time exposure to Amazon Web Services, AWS command line interface, and AWS data pipeline.
  • Work experience with cloud infrastructure like Amazon Web Services (AWS).
  • Extensive experience in working with various distributions of Hadoop like enterprise versions of Cloudera(CDH4/CDH5), Hortonworks and good knowledge on MAPRdistribution, IBMBigInsights and Amazon’sEMR (Elastic MapReduce).
  • Experience in design and develop the POC in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Developed automated processes for flattening the upstream data from Cassandra which in JSON format. Used Hive UDFs to flatten the JSON Data.
  • Expertise in developing responsive Front-End components with JavaScript, JSP, HTML, XHTML,Servlets, Ajax, and AngularJS.
  • Experience as a Java Developer in Web/intranet, client/server technologies using Java, J2EE, Servlets, JSP, JSF, EJB, JDBC and SQL.
  • Experience in setting up automated monitoring and escalation infrastructure for Hadoop Cluster using Ganglia and Nagios.
  • Good Understanding in Apache Hue.
  • Techno-functional responsibilities include interfacing with users, identifying functional and technical gaps, estimates, designing custom solutions, development, leading developers, producing documentation, and production support.
  • Good in using version control like GITHUB and SVN.


Hadoop Distribution: Horton works, Cloudera (CDH3, CDH4, CDH5), Apache, Amazon AWS(EMR),MapR and Azure.

Hadoop Data Services: Hadoop HDFS, Map Reduce, Yarn,HIVE, PIG, Pentaho, HBase, ZooKeeper, Sqoop, Oozie, Cassandra, Spark, Scala, Storm, Flume, Kafka and Avro,Parquet,Snappy,Nifi.

Hadoop Operational Services: Zookeeper, Oozie

NO SQL Databases: HBase, Cassandra, MongoDB, Neo4j, Redis

Cloud Services: Amazon AWS

Languages: SQL, PL/SQL, Pig Latin, HiveQL, Unix Shell Scripting,HTML,XML (XSD, XSLT,DTD) C, C++, Java, JavaScript Python, Scala

ETL Tools: Informatica, IBM DataStage, Talend

Java & J2EE Technologies: Core Java, Servlets, Hibernate, Spring, Struts, JSP, JDBC, EJB

Application Servers: Web Logic, Web Sphere, Tomcat.

Databases: Oracle,MySQL,DB2, Teradata,MS SQL Server,SQL/NOSQL,HBase,Cassandra,Neo4j

Operating Systems: UNIX, Windows, iOS, LINUX

Methodologies: Agile(Scrum), Waterfall

Other Tools: Putty, WinSCP, Stream Weaver.


Confidential, Grapevine, TX

Sr. Spark/Hadoop Developer


  • Extensively migrated existing architecture toSpark Streaming to process the live streaming data.
  • Responsible forSparkCore configuration based on type of Input Source.
  • Executed Spark code using Scala forSpark Streaming/SQL for faster processing of data.
  • Performed SQL Joins among Hive tables to get input forSparkbatch process.
  • Gathered the business requirements from the Business Partners and Subject Matter Experts.
  • Developed Python code to gather the data from HBase and designs the solution to implement usingPySpark.
  • DevelopedPySparkcode to mimic the transformations performed in the on-premise environment.
  • Analyzed the Sql scripts and designed solutions to implement using pyspark. created custom new columns depending up on the use case while ingesting the data into Hadoop lake using pyspark.
  • Analyze Cassandra database and compare it with other open-source NoSQL databases to find which one of them better suites the current requirement.
  • Integrated Cassandra as a distributed persistent metadata store to provide metadata resolution for network entities on the network.
  • Implemented Spark using Scala and also used Pyspark using Python for faster testing and processing of data.
  • Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster
  • Involved in converting Hive/Sql queries into Spark transformations using Spark RDD’s.
  • Loading data from Linux file system to HDFS and vice-versa
  • Developed UDF’s using both Data Frames/Sql and RDD in SparkforData Aggregation queries and reverting back into OLTP through sqoop.
  • Knowledge of ETL methods for data extraction, transformation and loading in corporate-wide ETL Solutions and Data warehouse tools for reporting and data analysis.
  • Implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • Installed and monitored Hadoop ecosystems tools on multiple operating systems like Ubuntu, CentOS.
  • Exported the patterns analyzed back into Teradata using Sqoop. Continuous monitoring and managing theHadoopcluster through Cloudera Manager.
  • Continuously monitored and managed theHadoopCluster using ClouderaManager.
  • Participated in development/implementation of Cloudera ImpalaHadoopenvironment.
  • Utilized ApacheHadoopenvironment by Cloudera.
  • Collect the data using SparkStreaming and dump into Cassandra Cluster
  • Developed Scala scripts using both Data frames/SQL/Datasets and RDD/MapReduce in Spark for Data aggregation, queries and writing data back into OLTP system throughSqoop.
  • Extensively use Zookeeper as job scheduler for SparkJobs.
  • Extending Hive and Pig core functionality by writing custom UDFs.
  • Wrote Java code to format XML documents; upload them toSolrserver for indexing.
  • Used AWS to export MapReduce jobs into Spark RDD transformations.
  • Writing AWS Terraform templates for any automation requirements in AWS services.
  • Used Spark API over Hortonworks HadoopYARN to perform analytics on data in Hive.
  • Deploy and configured cloud AWS EC2 for client websites moving from self-hosted services for scalability purposes.
  • Work with multiple teams to provision AWSinfrastructure for development and production environments.
  • Experience in designing Kafka for multi data center cluster and monitoring it.
  • Designed number of partitions and replication factor for Kafka topics based on business requirements.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark).
  • Implemented a distributed messaging queue to integrate with Cassandra using Apache Kafka and Zookeeper.
  • Experience on Kafka and Spark integration for real time data processing.
  • Developed Kafka producer and consumer components for real time data processing.
  • Hands-on experience for setting up Kafka mirror maker for data replication across the cluster’s.
  • Experience in Configure, Design, Implement and monitor Kafka Cluster and connectors.
  • Oracle SQL tuning using explain plan.
  • Manipulate, serialize, model data in multiple forms like JSON, XML.
  • Involved in setting up map reduce 1 and map reduce 2.
  • Prepared Avro schema files for generating Hive tables.
  • Used Impala connectivity from the User Interface(UI) and query the results using ImpalaQL.
  • Worked on physical transformations of data model which involved in creating Tables, Indexes, Joins, Views and Partitions.
  • Involved in Analysis, Design, System architectural design, Process interfaces design, design, documentation.
  • Used Jira for bugtracking and BitBucket to check-in and checkout code changes.
  • Involved in CassandraData modelling to create key spaces and tables in multi Data Center DSECassandraDB.
  • Utilized Agile and Scrum Methodology to help manage and organize a team of developers with regular code review sessions.

Environment: Cloudera, Spark, Impala, Sqoop, Flume,Cassandra,Kafka,Hive, Zookeeper,Oozie,RDBMS,AWS.




  • Analyzed large and critical datasets using Cloudera, HDFS, HBase, MapReduce, Hive, Hive UDF, Pig, Sqoop, Zookeeper and Spark.
  • Responsible to manage data coming from different sources.
  • Developed Batch Processing jobs using Pig and Hive.
  • Involved in gathering the business requirements from the Business Partners and Subject Matter Experts.
  • Worked with different File Formats like TEXTFILE, AVROFILE, ORC, and PARQUET for HIVE querying and processing.
  • Importing and exporting data into HDFS and Hive using Sqoop.
  • Implemented Elastic Search on Hive data warehouse platform.
  • Good experience in analyzing Hadoop cluster and different analytic tools like Pig, Impala.
  • Experienced in managing andreviewingHadooplog files.
  • Extracted files from CouchDB through Sqoop and placed in HDFS and processed.
  • Experienced in runningHadoopstreaming jobs to process terabytes of xml format data.
  • Experienced in working with spark eco system using SparkSQL and Scala queries on different formats like Text file, CSV file.
  • Created concurrent access for Hive tables with shared and exclusive locking that can be enabled in Hive with the help of Zookeeper implementation in the cluster.
  • Storing and loading the data from HDFS to Amazon S3 and backing up the Namespace data into NFS.
  • Implemented Name Node backup using NFS. This was done for High availability.
  • Designed workflows and coordinators in Oozie to automate and parallelize Hive jobs on Apache Hadoop environment by Hortonworks (HDP 2.2).
  • Responsible for building scalable distributed data solutions usingHadoopcluster environment with Horton works distribution.
  • Integrated Hive server 2 with Tableau using Horton Works Hive ODBC driver, for auto generation of Hive queries for non-technical business user.ni
  • Troubleshooting, Manage and review data backups, Manage and reviewHadooplog files. Hortonworks Cluster.
  • Used PIG to perform data validation on the data ingested using Sqoop and Flume and the cleansed data set is pushed into MongoDB.
  • Ingested streaming data with Apache NiFi into Kafka.
  • Worked with Nifi for managing the flow of data from sources through automated data flow.
  • Designed and implemented the MongoDB schema.
  • Wrote services to store and retrieve user data from the MongoDB for the application on devices.
  • Used Mongoose API to access the MongoDB from NodeJS.
  • Created and Implemented Business validation and coverage Price Gap Rules in Talend on Hive, using TalendTool.
  • Wrote shell scripts for rolling day-to-day processes and it is automated.
  • Written the shell scripts to monitor the data ofHadoopdaemon services and respond accordingly to any warning or failure conditions.

Environment: Apache Flume, Hive, Pig, HDFS, Zookeeper, Sqoop, RDBMS, AWS, MongoDB, Talend, Shell Scripts, Eclipse, WinSCP, Hortonworks.

Hire Now