We provide IT Staff Augmentation Services!

Spark/hadoop Developer Resume

Chicago, IL


  • 7 years of professional IT experience in all phases of Software Development Life Cycle including hands on experience in Java/J2EE technologies and Big Data Analytics.
  • 4+years of work experience in ingestion, storage, querying, processing and analysis of BigData with hands on experience in Hadoop Ecosystem development including Mapreduce, HDFS, Hive, Pig, Spark, Cloudera Navigator, Mahout, HBase, ZooKeeper, Sqoop, Flume, Oozie and AWS.
  • Extensive experience working in Teradata, Oracle, Netezza, SQLServer and MySQL database.
  • Excellent understanding and knowledge of NOSQL databases like MongoDB, HBase, and Cassandra.
  • Strong experience working with different Hadoop distributions like Cloudera, Hortonworks, MapR and Apache distributions.
  • Experienced in installation, configuring, supporting and managing Hadoop Clusters using Apache, Cloudera (CDH 5.X) distributions and on Amazon web services (AWS).
  • Experience in Amazon AWS services such as EMR, EC2, S3, CloudFormation, RedShift which provides fast and efficient processing of Big Data.
  • In depth understanding/knowledge of Hadoop Architecture and various components such as HDFS, MR, Hadoop GEN2 Federation, High Availability and YARN architecture and good
  • Good hands on experience in developing Hadoop applications on SPARK using SCALA as a functional and object oriented programming.
  • Understanding of workload management, scalability and distributed platform architectures.
  • Good understanding of R Programming, Data Mining and Machine Learning techniques.
  • Strong experience and knowledge of real time data analytics using Storm, Kafka, Flume and Spark.
  • Experienced in troubleshooting errors in HBase Shell, Pig, Hive and MapReduce.
  • Experienced in installing and maintaining Cassandra by configuring the cassandra.yaml file as per the requirement.
  • Involved in upgrading existing MongoDB instances from version 2.4 to version 2.6 by upgrading the security roles and implementing newer features.
  • Responsible for performing reads and writes in Cassandra from and web application by using java JDBC connectivity.
  • Experience Setting up databases inAWSusing RDS, storage using S3 bucket and configuring instance backups to S3 bucket.
  • Implement and maintain the monitoring and alerting of production and corporate servers/costs using Cloud Watch.
  • Experience with designing and configuring secure Virtual Private Cloud (VPC) through private and public networks inAWSand create various subnets, routing table, internet gateways for servers.
  • Experience on bootstrapping and maintainingAWSusing Chef on complex hybrid IT infrastructure nodes through the VPN and Jump Servers
  • Experience on creating and performance tuning of Vertica, Hive scripts.
  • Strong hold on Informatica powercenter, Oracle, Vertica, hive, SQL Server, Shell scripting and Qlikview.
  • Very Good understanding and Working Knowledge of Object Oriented Programming(OOPS), Python andScala.
  • Experienced in extending HIVE and PIG core functionality by using custom UDF's and UDAF's.
  • Debugging MapReduce jobs using Counters and MRUNIT testing. implemented Spark Scripts using Scala, Spark SQL to access Hive tables into Spark for faster processing of data.
  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Expertise in writing the Real - time processing application Using spout and bolt in Storm.
  • Experienced in configuring various topologies in storm to ingest and process data on the fly from multiple sources and aggregate into central repository Hadoop.
  • Good understanding of Spark Algorithms such as Classification, Clustering, and Regression.
  • Good understanding on Spark Streaming with Kafka for real-time processing.
  • Extensive experienced working with Spark tools like RDD transformations, spark MLlib and spark QL.
  • Experienced in moving data from different sources using Kafka producers, consumers and preprocess data using Storm topologies.
  • Experienced in migrating ETL transformations using Pig Latin Scripts, transformations, join operations.
  • Good understanding of MPP databases such as HP Vertica, Greenplum and Impala.
  • Good knowledge on streaming data from different data sources like Log files, JMS, applications sources into HDFS using Flume sources.
  • Experienced in importing and exporting data using Sqoop from HDFS to Relational Database Systems and vice-versa.
  • Worked on docker based containerized applications.
  • Knowledge of data warehousing and ETL tools like Informatica, Talend and Pentaho.
  • Experienced in working with monitorig tools to check status of cluster using Cloudera manager, Ambari and Ganglia.
  • Developed Spark scripts by usingScalashell commands as per the requirement.
  • Experienced with Testing MapReduce programs using MRUnit, Junit.
  • Extensive experience in middle-tier development using J2EE technologies like JDBC, JNDI, JSP, Servlets, JSF, Struts, Spring, Hibernate, EJB.
  • Expert on MicrosoftPowerBIand Tableau reports, dashboards and publishing to the end users for executive level Business Decision.
  • Expert on maintaining and managing Tableau andPOWERBIdriven reports and dashboards.
  • Expertise in developing responsive Front End components with JSP, HTML, XHTML, JavaScript, DOM,Servlets, JSF, NodeJS, Ajax, JQuery and AngularJS.
  • Extensive experience in working with SOA based architectures using Rest based web services using JAX-RS and SOAP based web services using JAX-WS.
  • Building, publishing customized interactive reports and dashboards, report scheduling using


Big Data Ecosystem: Hadoop, MapReduce, Pig, Hive, YARN, Kafka, Flume, Sqoop, Impala, Oozie, ZooKeeper, SparkSolr, Storm, Drill, Ambari, Mahout, MongoDB, Cassandra, Avro, Parquet and Snappy.

Hadoop Distributions: Cloudera, MapReduce, Hortonworks, IBM Big Insights

Languages: Java, Scala, Python, ruby, SQL, HTML, DHTML, JavaScript, XML and C/C++

No SQL Databases: Cassandra, MongoDB and HBase

Java Technologies: Servlets, JavaBeans, JSP, JDBC, JNDI, EJB and struts

XML Technologies: XML, XSD, DTD, JAXP (SAX, DOM), JAXB

Web Design Tools: HTML, DHTML, AJAX, JavaScript, JQuery and CSS, AngularJs, ExtJS and JSON

Development / Build Tools: Eclipse, Ant, Maven, Gradle, IntelliJ, JUNIT and log4J.

Frameworks: Struts, spring and Hibernate

App/Web servers: WebSphere, WebLogic, JBoss and Tomcat

DB Languages: MySQL, PL/SQL, PostgreSQL and Oracle

RDBMS: Teradata, Oracle Pl/SQL, MS SQL Server, MySQL and DB2

Operating systems: UNIX, LINUX, Mac OS and Windows Variants

Data analytical tools: R, SAS and MATLAB

ETL Tools: Ab initio, Informatica Power center and Pentaho

Reporting tools: Tableau


Confidential, Chicago IL

Spark/Hadoop Developer


  • Responsible for building scalable distributed data solutions using Hadoop.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Worked on Optimizing of existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames and Pair RDD's.
  • Implemented ELK (Elastic Search, Log stash, Kibana) stack to collect and analyze the logs produced by the spark cluster.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark.
  • Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems as well as RDBMS and NoSQL data stores for data access and analysis.
  • Worked on a POC to compare processing time of Impala with Apache Hive for batch applications to implement the former in project.
  • Worked on Cluster of size 130 nodes.
  • Worked extensively with Sqoop for importing metadata from Oracle.
  • Analyzed the SQL scripts and designed the solution to implement using Pyspark
  • Responsible for developing data pipeline with Amazon AWS to extract the data from weblogs and store in HDFS.
  • Amazon Cloud Watch with Amazon EC2 instances for monitoring the log files, store them and track metrics.
  • CreatedAWSS3 buckets, performed folder management in each bucket, Managed cloud trail logs and objects within each bucket.
  • Created Highly Available Environments using Auto-Scaling, Load Balancers, and SQS.
  • Hands on Experience inAWSCloud in variousAWSServices such as RedShift, Cluster, Route53 Domain configuration.
  • Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica Power Center
  • Create Chef Automation tools and builds, and do an overall process improvement to any manual processes.
  • Written Chef Cookbooks for various DB configurations to modularize and optimize product configuration, converting production support scripts to Chef Recipes andAWSserver provisioning using Chef Recipes.
  • Experienced with AWS terraform
  • Wrote some scripts in python.
  • Developed oozie workflow for scheduling & orchestrating the ETL process.
  • Involved in creating Hive tables, and loading and analyzing data using hive queries
  • Developed Hive queries to process the data and generate the data cubes for visualizing
  • Implemented schema extraction for Parquet and Avro file Formats in Hive.
  • Implemented Partitioning, Dynamic Partitions, Buckets in HIVE.
  • Used Reporting tools like Tableau to connect with Hive for generating daily reports of data.
  • Collaborated with the infrastructure, network, database, application and BI teams to ensure data quality and availability.

Environment: Hadoop YARN, Spark Core, Spark Streaming, Spark SQL, Scala, Python, Kafka, Hive, Sqoop, Amazon AWS, Elastic Search, Impala, Cassandra, Tableau, Talend, Oozie, Jenkins, Cloudera, Oracle 12c, Linux

Confidential, Richmond VA

Big data/Spark/Hadoop Developer


  • Worked on analyzing Hadoop cluster using different big data analytic tools including Pig, Hive and MapReduce.
  • Loaded the data from vertica to Hive using Sqoop.
  • Worked on Installation and configuring of Zoo Keeper to co-ordinate and monitor the cluster resources.
  • Implemented test scripts to support test driven development and continuous integration.
  • Worked on POC’s with Apache Spark using Scala to implement spark in project.
  • Consumed the data from Kafka using Apache spark.
  • Load and transform large sets of structured, semi structured and unstructured data.
  • Involved in loading data from LINUX file system to HDFS
  • Importing and exporting data into HDFS and Hive using Sqoop
  • Implemented Partitioning, Dynamic Partitions, Buckets in Hive
  • Worked in creating HBase tables to load large sets of semi structured data coming from various sources.
  • Extending HIVE and PIG core functionality by using custom User Defined Function’s (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig using python.
  • Worked on Tableau for exposingdatafor further analysis and for generating transforming files from different analytical formats to text files
  • Responsible for developing, support and maintenance for the ETL (Extract, Transform and Load) processes using Informatica Power Center
  • Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
  • Involved in scheduling Oozie workflow engine to run multiple Hive and pig jobs
  • Experienced with performing CURD operations in HBase.
  • Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
  • Responsible for loading data files from various external sources like ORACLE, MySQL into staging area in MySQL databases.
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Actively involved in code review and bug fixing for improving the performance.
  • Good experience in handling data manipulation using python Scripts.
  • Involved in development, building, testing, and deploy to Hadoop cluster in distributed mode.
  • Created Linux shell Scripts to automate the daily ingestion of IVR data
  • Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business requirements.
  • Experienced in working with spark eco system using Spark SQL andScalaqueries on different formats like Text file, CSV file.
  • Launching and configuring of Amazon EC2 (AWS) Cloud Servers using AMI's (Linux/Ubuntu) and configuring the servers for specified applications.
  • Developing scripts for build, deployment, maintain and related task using Jenkins, Maven,Python, Bash.
  • Good experience in writing Spark applications usingScalaand Java and usedScalaabout to developScalaprojects and executed using Spark-submit
  • Experience in Writing theScalafunctions, procedures, Constructors and Traits.
  • Helped the Analytics team with Aster queries using HCatlog.
  • Automated the History and Purge Process.
  • Created HBase tables to store various data formats of incoming data from different portfolios.
  • Created Pig Latin scripts to sort, group, join and filter the enterprise wise data.
  • Developed the verification and control process for daily load.
  • Experience in Daily production support to monitor and trouble shoots Hadoop/Hive jobs

Environment: Hadoop, HDFS, Pig, Apache Hive, Sqoop, Kafka, Apache Spark, Storm, Solr, Shell Scripting, HBase, Python, Kerberos, Agile, Zoo Keeper, Maven, Ambari, Horton Works, MySQL and Tableau.

Confidential, Atlanta GA

Big Data Hadoop Developer/Administrator


  • Gathered User requirements and designed technical and functional specifications.
  • Worked on analyzing Hadoop cluster and different Big Data analytic tools including Pig, Hbasedatabase and Sqoop.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Implemented nine nodes CDH3 Hadoop cluster on Red hat LINUX.
  • Involved in loading data from LINUX file system to HDFS.
  • Worked on installing cluster, commissioning and decommissioning of DataNode, NameNode recovery, capacity planning, and slots configuration.
  • Created HBase tables to store variable data formats of PII data coming from different portfolios.
  • Implemented a script to transmit Sys Prin information from Oracle toHbase using Sqoop.
  • Implemented best income logic using Pig scripts and UDFs.
  • Implemented test scripts to support test driven development and continuous integration.
  • Worked on tuning the performance Pig queries.
  • Worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Responsible to manage data coming from different sources.
  • Involved in loading data from UNIX file system to HDFS.
  • Loaded and transformed large sets of structured, semi structured and unstructured data.
  • Clustered coordination services through Zookeeper.
  • Experienced in managing and reviewing Hadoop log files.
  • Job management using Fair Scheduler.
  • Migrated Teradata MLoads/BTEQ's/Oracle SQL to HP vertica SQL's(VSQL)
  • Extensively worked in vertica projections creation/ optimization.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Migrated 100+ tables from Teradata to HP vertica.
  • Responsible for cluster maintenance, added and removed cluster nodes, cluster monitoring and troubleshooting, managed and reviewed data backups, managed and reviewed Hadoop log files.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs.
  • Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.
  • Supported in setting up QA environment and updated configurations for implementing scripts with Pig and Sqoop.

Environment: Hadoop, HDFS, Pig, Sqoop, HBase, Shell Scripting, Ubuntu, Linux Red Hat.

Hire Now