We provide IT Staff Augmentation Services!

Data Engineer/hadoop Developer Resume

New York, NY


  • Over 10 years of IT experience as a Developer, Designer & QA Engineer with cross - platform integration experience using Hadoop Ecosystem, Java and functional automation
  • Hands on experience in installing, configuring and architecting Hadoop and Hortonworks clusters and services - HDFS, MapReduce, Yarn, Pig, Hive, Oozie, Flume, HBase, Spark, Sqoop, Flume and Oozie
  • Responsible for writing MapReduce programs
  • Experienced in loading data to Hive partitions and created buckets in Hive and developed MapReduce jobs to automate transfer the data from HBase
  • Experienced in developing Java UDFs for Hive and Pig
  • Experienced in NoSQL DBs like HBase, MongoDB and Cassandra and wrote advanced query and sub-query
  • Scheduled all Hadoop/hive/Sqoop/HBase jobs using Oozie
  • Set up clusters in Amazon EC2 and S3 including the automation of setting & extending the clusters in AWS
  • Practiced Agile Scrum methodology, contributed to TDD, CI-CD and all aspects of SDLC
  • Experienced in defining detailed application software test plans, including organization, participant, schedule, test and application coverage scope
  • Gathered and defined functional and UI requirements for software applications
  • Experienced in real time analytics with Apache Spark RDD, Data Frames and Streaming API
  • Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data
  • Experienced in integrating Hadoop with Kafka, experienced in uploading Clickstream data from to HDFS.
  • Expert in utilizing Kafka for messaging and publishing subscribe messaging system.
  • Experienced with Docker and Kubernetes on multiple cloud providers, from helping developers build and containerize their application (CI/CD) to deploying either on public or private cloud.
  • Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
  • DevOps Practice for Micro Services using Kubernetes as Orchestrator.
  • Created templates and wrote Shell scripts (Bash), Ruby, Python and PowerShell for automating tasks.
  • Good knowledge and hands on Experience in monitoring tools like Splunk, Nagios.
  • Knowledge of using Routed Protocols as FTP, SSH, HTTP, TCP/IP, HTTPS, DNS, VPN'S and Firewall Groups.
  • Complete application builds for Web Applications, Web Services, Windows Services, Console Applications, and Client GUI applications.
  • Experienced in troubleshooting and automated deployment to web and application servers like WebSphere, WebLogic, JBOSS and Tomcat.
  • Experienced in deploy to Integrate with multiple build systems and to provide an application model handling multiple project.
  • Hands on experience with integrating Rest API's to cloud environment to access resources.


Hadoop/Big Data: Hadoop, Map Reduce, HDFS, Zookeeper, Kafka, Hive, Pig, Sqoop, OozieFlume, Yarn, HBase, Spark with Scala

No SQL Databases: HBase, Cassandra, Mongo DB

Languages: Java, Python, UNIX shell scripts

Java/J2EE Technologies: Applets, Swing, JDBC, JNDI, JSON, JSTL

Frameworks: MVC, Struts, Spring, Hibernate

Operating Systems: Red Hat Linux, Ubuntu Linux and Windows XP/Vista/7/8

Web Technologies: HTML, DHTML, XML

Web/Application servers: Apache Tomcat, WebLogic, JBoss

Databases: SQL Server, MySQL

IDE: Eclipse, IntelliJ IDEA



Confidential, New York, NY


  • Administered, maintained, provisioned, patched and maintained Cloudera Hadoop clusters on Linux
  • Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
  • Developed shell scripts to perform Data Quality validations like Record count, File name consistency, Duplicate File and for creating Tables and views.
  • Creating the views by masking PHI Columns for the table, so that data in the view for the PHI columns cannot be seen by unauthorized teams.
  • Worked on Parquet File format to get a better storage and performance for publish tables.
  • Worked with NoSQL databases like HBase in creating HBase tables to store the audit data of the RAWZ and APPZ tables.
  • Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
  • Developed a Restful API using & Scala for tracking open source projects in GitHub and computing the in-process metrics information for those projects.
  • Developed analytical components using Scala, Spark, Apache Mesos and Spark Stream.
  • Experience in using the Docker container system with the Kubernetes integration
  • Developed a Web Application using Java with the Google Web Toolkit API with PostgreSQL
  • Used R for prototype on a sample data exploration to identify the best algorithmic approach and then wrote Scala scripts using spark machine learning module.
  • Developed MapReduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS. Implemented a Python-based distributed random forest via Python streaming.
  • Utilized Spark, Scala, Hadoop, HBase, Kafka, Spark Streaming, Caffe, TensorFlow, MLLib, Python, a broad variety of machine learning methods including classifications, regressions, dimensionally reduction etc.
  • Excellent understanding / knowledge of Hadoop architecture and various components such as HDFS, HBase, Hive Job Tracker, Task Tracker, Name Node, Data Node and Map Reduce programming paradigm
  • Built Kafka-Spark-Cassandra Scala simulator for Met stream, a big data consultancy; Kafka-Spark-Cassandra prototypes.
  • Designed a data analysis pipeline in Python, using Amazon Web Services such as S3, EC2 and Elastic Map Reduce
  • Implemented applications with Scala along with Akka and Play framework.
  • Expert in implementing advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.
  • It is python and Scala based analytic system with ML Libraries.
  • Worked with NoSQL Platforms and Extensive understanding on relational databases versus No-SQL platforms. Created and worked on large data frames with a schema of more than 300 columns.
  • Ingestion of data into Amazon S3 using Sqoop and apply data transformations using Python scripts.
  • Creating Hive tables, loading and analyzing data using hive scripts. Implemented Partitioning, Dynamic Partitions in HIVE.
  • Deployed and analyzed large chunks of data using HIVE as well as HBase.
  • Worked on querying data using Spark SQL on top of pyspark engine.
  • Used Amazon EMR to perform the Pyspark Jobs on the Cloud. .
  • Created Hive tables to store various data formats of PII data coming from the raw hive tables.
  • Developed Sqoop jobs to import/export data from RDBMS to S3 data store.
  • Designed and implemented Pyspark UDF's for evaluation, filtering, loading and storing of data.
  • Fine-tuning pyspark applications/jobs to improve the efficiency and overall processing time for the pipelines.
  • Knowledge of writing Hive queries and running both scripts in tez mode to improve performance on Hortonworks Data Platform.
  • Used Microservices architecture, with Spring Boot based services interacting through a combination of REST and Spring Boot.
  • Built Spring Boot microservices for the delivery of software products across the enterprise
  • Created the ALB, ELBs and EC2 instances to deploy the applications into cloud environment.
  • Providing service discovery for all microservices using Spring Cloud Kubernetes project
  • Developed fully functional responsive modules based on Business Requirements using Scala with Akka,
  • Development of new listeners for producers and consumer for both Rabbit MQ and Kafka
  • Used Microservice with Spring Boot interacting through a combination of REST and Apache Kafka message brokers.
  • Worked in building servers like DHCP, PXE with kick-start, DNS and NFS and used them in building infrastructure in a Linux Environment. Automated build, testing and integration with Ant, Maven and JUnit.

Environment: Apache Hive, Hbase, Pyspark, python, Agile, Stream sets, Bitbucket, Cloudera, Shell Scripting, Amazon EMR, Amazon S3, PyCharm, Jenkins, Scala, Java.


Confidential, Stamford, CT


  • Administered, maintained, provisioned, patched and maintained Cloudera Hadoop clusters on Linux
  • Experienced in Spark Streaming and creating RDD and applying operations transformations and actions
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API
  • Developed Spark code using Scala and Spark-SQL for faster processing and testing.
  • Implemented Spark programs using PySpark and analyzed the SQL scripts and designed the solutions
  • Worked directly with the Big Data Architecture Team which created the foundation of this Enterprise Analytics initiative in a Hadoop-based Data Lake.
  • Created multi-node Hadoop and Spark clusters in AWS instances to generate terabytes of data and stored it in AWS HDFS.
  • Extracted real time feed using Kafka and Spark streaming and convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
  • Developed data pipeline using Flume, Sqoop, Pig and Java map reduce to ingest customer behavioral data and financial histories into HDFS for analysis.
  • Involved in collecting and aggregating large amounts of log data using Apache Flume and staging data in HDFS for further analysis.
  • Upgraded the Hadoop cluster from CDH4.7 to CDH5.2 and worked on installing cluster, commissioning & decommissioning of Data Nodes, Name Node recovery, capacity planning, and slots configuration.
  • Developed Spark scripts to import large files from Amazon S3 buckets and imported the data from different sources like HDFS/HBase into SparkRDD.
  • Involved in migration of ETL processes from Oracle to Hive to test the easy data manipulation and worked on importing and exporting data from Oracle and DB2 into HDFS and HIVE using Sqoop.
  • Worked on Installing Cloudera Manager, CDH and install the JCE Policy File to Create a Kerberos Principal for the Cloudera Manager Server, enabling Kerberos Using the Wizard.
  • Developed Spark jobs using Scala and Python on top of Yarn/MRv2 for interactive and Batch Analysis.
  • Monitored cluster for performance and, networking and data integrity issues and responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
  • Created 25+ Linux Bash scripts for users, groups, data distribution, capacity planning, and system monitoring.
  • Install OS and administrated Hadoop stack with CDH5 (with YARN) Cloudera distribution including configuration management, monitoring, debugging, and performance tuning.
  • Supported MapReduce Programs and distributed applications running on the Hadoop cluster and scripting Hadoop package installation and configuration to support fully automated deployments.
  • Migrated existing on-premises application to AWS and used AWS services like EC2 and S3 for large data sets processing and storage and worked with ELASTIC MAPREDUCE and setup Hadoop environment in AWS EC2 Instances.
  • Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing Hadoop clusters.
  • Perform maintenance, monitoring, deployments, and upgrades across infrastructure that supports all our Hadoop clusters and worked on Hive for further analysis and for generating transforming files from different analytical formats to text files.
  • Created Hive External tables and loaded the data into tables and query data using HQL and worked with application teams to install operating system, Hadoop updates, patches, version upgrades as required.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Monitoring Hadoop cluster using tools like Nagios, Ganglia, and Cloudera Manager and maintaining the Cluster by adding and removing of nodes using tools like Ganglia, Nagios, and Cloudera Manager.
  • Worked on Hive for exposing data for further analysis and for generating transforming files from different analytical formats to text files.
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala, initially done using python (PySpark)

    Environment: Hadoop, MapReduce, Hive, PIG, Sqoop, Python, Spark, Spark-Streaming, Spark SQL, AWS EMR, AWS S3, AWS Redshift, Python, Scala, Pyspark, MapR, Java, Oozie, Flume, HBase, Nagios, Ganglia, Hue, Cloudera Manager, Zookeeper, Cloudera, Oracle, Kerberos and RedHat 6.5




  • Administered, maintained, provisioned, patched and maintained Cloudera Hadoop clusters on Linux
  • Experienced in Spark Streaming and creating RDD and applying operations transformations and actions
  • Developed Spark applications using Scala for easy Hadoop transitions.
  • Used Spark and Spark-SQL to read the parquet data and create the tables in hive using the Scala API
  • Developed Spark code using Scala and Spark-SQL for faster processing and testing.
  • Implemented Spark programs using PySpark and analyzed the SQL scripts and designed the solutions
  • Loaded data pipelines from web servers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Developed Kafka pub-sub, Cassandra clients and Spark along with components on HDFS and Hive
  • Populated HDFS and HBase with huge amounts of data using Apache Kafka.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Managing and scheduling Spark Jobs on a Hadoop Cluster using Oozie.
  • Experienced with different scripting language like Python and shell scripts.
  • Developed various Python scripts to find vulnerabilities with SQL Queries by doing SQL injection, permission checks and performance analysis.
  • Tested Apache TEZ, an extensible framework for building high performance batch and interactive data processing applications, on Pig and Hive jobs.
  • Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ
  • Experienced data pipelines using Kafka and Akka for handling large terabytes of data.
  • Written shell scripts that run multiple Hive jobs which helps to automate different Hive tables incrementally which are used to generate different reports using Tableau for the Business use.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD, Scala and Python.
  • Worked on Spark SQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Using Spark-Streaming APIs to perform transformations and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
  • Involvement in creating custom UDFs for Pig and Hive to consolidate strategies and usefulness of Python into Pig Latin and HQL (HiveQL)
  • Extensively worked on Text, ORC, Avro and Parquet file formats and compression techniques like Snappy, Gzip and Zlib.
  • Experience in NoSQL Column-Oriented Databases like Cassandra and its Integration with Hadoop cluster.
  • Created, managed and utilized policies for S3 buckets and and Glacier for storage and backup AWS.
  • Wrote ETL jobs to read from web APIs using REST and HTTP calls and loaded into HDFS using Java and Talend.

Environment: Hadoop 2.0, YARN Resource Manager, SQL, Python, Kafka, Hive, Sqoop 1.4.6, Qlik Sense, Tableau, Oozie, Jenkins, Linux, Scala 2.12, Spark 2.4.3.


Confidential, New York, NY


  • Worked on installing Kafka on Virtual Machine and created topics for different users
  • Installed Zookeepers, brokers, schema registry, control Center on multiple machine.
  • Develop SSL security layers and setup ACL/SSL security for users and assigned multiple topics
  • Worked on Hadoop cluster and data querying tools Hive to store and retrieve data.
  • While developing applications involved in complete Software Development Life Cycle (SDLC).
  • Reviewed and managed Hadoop log files by consolidating logs from multiple machines using Flume.
  • Developed Oozie workflow for scheduling ETL process and Hive Scripts.
  • Started using apache NiFi to copy the data from local file system to HDFS.
  • Involved in teams to analyze the Anomaly detection and ratings of data.
  • Implemented custom input format and record reader to read XML input efficiently using SAX parser.
  • Involved in writing queries in SparkSQL using Scala. Worked with Splunk to analyze and visualize data.
  • Integrated Cassandra as to provide metadata resolution for network entities on the network
  • Experienced in Spark RDD operations and optimized transformations and actions
  • Involved in working with Impala for data retrieval process.
  • Exported data from Impala to Tableau reporting tool, created dashboards on live connection.
  • Designed multiple Python packages that were used within a large ETL process used to load 2TB of data from an existing Oracle database into a new PostgreSQL cluster
  • Loaded data from Linux file system to HDFS and vice-versa
  • Developed UDF's using both DataFrames/SQL and RDD in Spark for data Aggregation queries and reverting back into OLTP through Sqoop.
  • Leveraged ETL methods for ETL solutions and data warehouse tools for reporting and analysis
  • Used CSV Excel Storage to parse with different delimiters in PIG.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Developed multiple MapReduce jobs in java to clean datasets.
  • Developed code to write canonical model JSON records from numerous input sources to Kafka queues.
  • Performed streaming of data into Apache ignite by setting up cache for efficient data analysis.
  • Collected the logs data from web servers and integrated in to HDFS using Flume.
  • Developed UNIX shell scripts for creating the reports from Hive data.
  • Prepared Avro schema files for generating Hive tables and Created Hive tables and loaded the data into tables and query data using HQL
  • Installed and Configured Hadoop cluster using AWS for POC purposes.

Environment: Hadoop MapReduce 2 (YARN), NiFi, HDFS, PIG, Hive, Flume, Cassandra, Eclipse, Sqoop, Spark, Splunk, Maven, Cloudera, Linux shell scripting


Confidential, New York, NY


  • Developed NiFi workflows to automate the data movement between different Hadoop systems.
  • Configured deployed and maintained multi-node Dev and Test Kafka Clusters.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Used Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.
  • Developed Scala scripts, UDFFs using both Data frames/SQL/Data sets and RDD/MapReduce in Spark 1.6 for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
  • Imported large datasets from DB2 to Hive Table using Sqoop
  • Implemented Apache PIG scripts to load data from and to store data into Hive.
  • Partitioned and bucketed Hive tables and compressed data with Snappy to load data into Parquet hive tables from Avro hive tables
  • Involved in running all the hive scripts through hive, Impala, Hive on Spark and some through Spark SQL
  • Developed Spark scripts by using Scala Shell commands as per the requirement.
  • Worked and learned a great deal from AWS Cloud services like EC2, S3, EBS, RDS and VPC.
  • Responsible for implementing ETL process through Kafka-Spark-HBase Integration as per the requirements of customer facing API
  • Worked on Batch processing and real-time data processing on Spark Streaming using Lambda architecture
  • Developing and maintaining Workflow Scheduling Jobs in Oozie for importing data from RDBMS to Hive
  • Utilized Spark Core, Spark Streaming and Spark SQL API for faster processing of data instead of using MapReduce in Java
  • Responsible for data extraction and data integration from different data sources into Hadoop Data Lake by creating ETL pipelines Using Spark, MapReduce, Pig, and Hive.
  • Fetched live stream data from DB2 to Hbase table using Spark Streaming and Apache Kafka.
  • Load the data into Spark RDD and do in memory data Computation to generate the Output response.
  • Used Spark for interactive queries, processing of streaming data and integration with MongoDB
  • Wrote different pig scripts to clean up the ingested data and created partitions for the daily data.
  • Developed Spark programs with Scala to process the complex unstructured and structured data sets
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala and Python.
  • Analyzed the SQL scripts and designed the solution to implement using PySpark.
  • Involved in converting MapReduce programs into Spark transformations using Spark RDD in Scala.
  • Used Oozie workflow to co-ordinate pig and Hive Scripts

Environment: Hadoop, HDFS, Pig, Sqoop, Shell Scripting, Ubuntu, Linux Red Hat, Spark, Scala, Hortonworks, Cloudera Manager, Apache Yarn.




  • Responsible for implementation, administration and management of Hadoop infrastructures
  • Evaluation of Hadoop infrastructure requirements and design/deploy solutions (high availability, big data clusters and involved in cluster monitoring and troubleshooting Hadoop issues
  • Worked with application teams to install OSs and Hadoop updates, patches, version upgrades as required
  • Helped maintain and troubleshoot UNIX and Linux environment
  • Analyzed and evaluated system security threats and safeguards
  • Developed Pig program for loading and filtering the streaming data into HDFS using Flume.
  • Experienced in handling data from different datasets, join them and preprocess using Pig join operations.
  • Developed HBase data model on top of HDFS data to perform real time analytics using Java API.
  • Developed different kind of custom filters and handled pre-defined filters on HBase data using API.
  • Imported and exported data from Teradata to HDFS and vice-versa.
  • Strong understanding of Hadoop eco system such as HDFS, MapReduce, HBase, Zookeeper, Pig, Hadoop streaming, Sqoop, Oozie and Hive
  • Implement counters on HBase data to count total records on different tables.
  • Experienced in handling Avro data files by passing schema into HDFS using Avro tools and Map Reduce.
  • Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.
  • We used Amazon Web Services to perform big data analytics.
  • Implemented Secondary sorting to sort reducer output globally in map reduce.
  • Implemented data pipeline by chaining multiple mappers by using Chained Mapper.
  • Created Hive Dynamic partitions to load time series data
  • Handled different types of joins in Hive like Map joins, bucker map joins, sorted bucket map joins.
  • Created tables, partitions, buckets and perform analytics using Hive ad-hoc queries.
  • Experienced import/export data into HDFS/Hive from relational data base and Tera data using Sqoop.
  • Handled continuous streaming data from different sources using Flume and set destination as HDFS.
  • Integrated spring schedulers with Oozie client as beans to handle cron jobs.
  • Experience with CDH distribution and Cloudera Manager to manage and monitor Hadoop clusters
  • Actively participated in software development lifecycle (scope, design, implement, deploy, test), including design and code reviews.
  • Involved in story-driven agile development methodology and actively participated in daily scrum meetings.
  • Worked on spring framework for multi-threading.

Environment: Hadoop, HDFS, Map Reduce, Hive, Pig, Sqoop, RDBMS/DB, Flat files, Teradata, MySQL, CSV, Avro data files. JAVA, J2EE.

Hire Now