We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Owings Mills, MarylanD


  • 7+ years of professional IT experience which includes 3+ years of proven experience in Hadoop Administration on Cloudera (CDH), Hortonworks (HDP) Distributions, Vanilla Hadoop, MapR and 3+ year of experience in AWS, Kafka, Elastic Search, Devops and Linux Administration.
  • Proficient with Shell, Python, Ruby, YAML, Groovy scripting languages & Terraform.
  • Configured Elastic Load Balancing (ELB) for routing traffic between zones, and used Route53 with failover and latency options for high availability and fault tolerance.
  • Site Reliability Engineering responsibilities for Kafka platform that scales 2 GB/Sec and 20 Million messages/sec.
  • Worked on analyzing Data with HIVE and PIG.
  • Combined views and reports into interactive dashboards in Tableau Desktop that were presented to Business Users, Program Managers, and End Users.
  • Used Bash and Python, including Boto3 to supplement automation provided by Ansible and Terraform for tasks such as encrypting EBS volumes backing AMIs and scheduling Lambda functions for routine AWS tasks
  • Experienced in authoring POM.xml files, performing releases with Maven release plugin, modernization of Java projects, and managing Maven repositories.
  • Configured Elastic Search for log collections and Prometheus & Cloudwatch for metric collections
  • Branching, Tagging, Release Activities on Version Control Tools: SVN, GitHub.
  • Implemented and managed for Devops infrastructure architecture, Terraform, Jenkins, Puppet and Ansible implementation, Responsible for CI infrastructure and CD infrastructure and process and deployment strategy.
  • Experience Architecting, designing and implementing large scale distributed data processing applications built on distributed key value stores over Hadoop, Hbase, Hive, MapReduce, Yarn and other Hadoop ecosystem components Hue, Oozie, Spark, Sqoop, Pig and Zookeeper.
  • Expertise in Commissioned Data Nodes when data grew and Decommissioned when the hardware degraded.
  • Experience in Implementing High Availability of Name Node and Hadoop Cluster capacity planning, Experience in benchmarking, performing backup and disaster recovery of Name Node metadata and important and sensitive data residing on cluster.
  • Experience in creating S3 buckets and managed policies for S3 buckets and utilized S3 Buckets and Glacier for storage, backup and archived in AWS.
  • Experience in set up and maintenance of Auto - Scaling AWS stacks.
  • Team Player and self-starter possessing effective communication, motivation and organizational skills combined with attention to detail and business process improvements, hard worker with ability to meet deadlines on or ahead of schedules.


Big Data Tools: HDFS, MapReduce, YARN, Hive, Pig, Sqoop, Flume, Oozie, Kafka, hortonwork, Ambari, Knox, Phoniex, Impala, Storm.

Hadoop Distribution: Cloudera Distribution of Hadoop (CDH).

Operating Systems: UNIX, Linux, Windows XP, Windows Vista, Windows 2003 Server

Servers: Web logic server, WebSphere and Jboss.

Programming Languages: Java, Pl SQL, Shell Script, Perl, Python.

Tools: Interwoven Teamsite, GMS, BMC Remedy, Eclipse, Toad, SQL Server Management Studio, Jenkins, GitHub, Ranger Test NG, Junit.

Database: MySQL, NoSQL, Couchbase, InfluxDB, Teradata, HBase, MongoDB, Cassandra, Oracle.

Processes: Incident Management, Release Management, Change Management.


Big Data Engineer

Confidential, Owings Mills, Maryland


  • Installed the Apache Kafka cluster and Confluent Kafka open source in different environments.
  • Basically, one can install kafka open source or confluent version on windows and Linux/Unix systems.
  • We need to install jdk 1.8 or later and make accessible to the entire box.
  • 3Download the Apache kafka opensource and Apache zookeeper and start configuring in the box where we want to run the cluster. nce both kafka and zookeeper up and running, we will be able to create the topics. Later we can produce and consume the data. To make it secure, plugin the security configuration with SSL encryption, SASL Authentication and ACLs.
  • Finally, creating the backup, adding clients, corgis, patch up and monitoring.
  • Intial design we can start with single node or three node cluster and start adding the nodes wherever requires.
  • The required features are CPU core:24, RAM memory:32/64 GB and Main Memory:500GB(least case) to 2 TB.
  • Basically usuage is for functional flow of data in parallel processing and distribute streaming platform.
  • Kafka replaces the traditional pub-sub model with ease, fault-tolerant, high thorughtput and low latency.
  • Installed and developed different POC's for different application/infrastructure teams both in Apache Kafka and Confluent open source for multiple clients.
  • Installing, monitoring and maintenance of the clusters in all environments.
  • Installed single node-single broker and multi-node multi broker clusters and encrypted with SSL/TLS, authenticate with SASL/PLAINTEXT, SASL/SCRAM and SASL/GSSAPI (Kerberos).
  • Integrated topic-level security and the cluster is full up and running for 24/7.
  • Installed Confluent Enterprise in Docker and kubernetes in a 18-node cluster.
  • Installed Confluent Kafka, applied security to it and monitoring with Confluent control center.
  • Involved in clustering with Cloudera and Hortonworks and not exposing zookeeper, provided the cluster to end user using the Kafka-connect to communicate.
  • Setup redundancy to the cluster and using the monitoring tools like yahoo-Kafka manager and setup performance tuning to get the data in real time approach without any latency.
  • Supported and worked for the Docker team to install Apache Kafka cluster in multimode and enabled security in the DEV environment.
  • Worked on Disk space issues in Production Environment by monitoring how fast that space is filled, review what is being logged created a long-term fix for this issue (Minimize Info, Debug, Fatal Logs, and Audit Logs).
  • Installed Kafka manager for consumer lags and for monitoring Kafka metrics also this has been used for adding topics, Partitions etc.
  • Successfully Generated consumer group lags from Kafka using their API
  • Successfully did set up a no authentication Kafka listener in parallel with Kerberos (SASL) Listener. In addition, I tested non-authenticated user (Anonymous user) in parallel with Kerberos user.
  • Installed Ranger in all environments for Second Level of security in Kafka Broker.
  • Involved in Data Ingestion Process to Production cluster.
  • Installed Docker for utilizing ELK, Influxdb, and Kerberos.
  • Installed Confluent Kafka open source and enterprise edition on Kubernetes using the helm charts of 10-node cluster and applied security SASL/PLAIN and SASL/SCRAM and pointed the cluster for outside access.
  • Designed and implemented by configuring Topics in new Kafka cluster in all environment.
  • Successfully secured the Kafka cluster with SASL/PLAINTEXT, SASL/SCRAM and SASL/GSSAPI (Kerberos).
  • Implemented Kafka Security Features using SSL and without Kerberos. Further, with more grain - fines Security I set up Kerberos to have users and groups this will enable more advanced security features.

Hadoop Administration

Confidential, San Francisco, CA


  • Installed/Configured/Maintained Apache Hadoop and Cloudera Hadoop clusters for application development and Hadoop tools like Hive, Pig, Hbase, Zookeeper and Sqoop.
  • Working on 4 Hadoop clusters for different teams, supporting 50+ users to use Hadoop platform, provide training to users to make Hadoop usability simple and updating them for best practices.
  • Implementing Hadoop Security on Hortonworks Cluster using Kerberos and Two-way SSL
  • Experience with Hortonworks, Cloudera CDH4 and CDH5 distributions
  • Installed Kerberos secured kafka cluster with no encryption on Dev and Prod. Also set up Kafka ACL's into it
  • Successfully did set up a no authentication kafka listener in parallel with Kerberos (SASL) Listener. Also I tested non authenticated user (Anonymous user) in parallel with Kerberos user.
  • Involved in implementing security on Hortonworks Hadoop Cluster using with Kerberos by working along with operations team to move non-secured cluster to secured cluster.
  • Contributed to building hands-on tutorials for the community to learn how to use Hortonworks Data Platform (powered by Hadoop) and Hortonworks Dataflow (powered by NiFi) covering categories such as Hello World, Real-World use cases, Operations.
  • Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like Hive, Pig, HBase, Zookeeper and Sqoop.
  • Managed 350+ Nodes CDH cluster with 4 petabytes of data using Cloudera Manager and Linux RedHat 6.5.
  • Experienced with deployments, maintenance and troubleshooting applications on Microsoft Azure Cloud infrastructure.
  • Involved in creating Spark cluster in HDInsight by create Azure compute resources with spark installed and configured.
  • Implemented Azure APIM modules for public facing subscription based authentication implemented Circuit Breaker for system fatal errors
  • Experience in creating and configuring Azure Virtual Networks (Vnets), subnets, DHCP address blocks, DNS settings, Security policies and routing.
  • Created Web App Services and deployed Asp.Net applications through Microsoft Azure Web App services.Creates Linux Virtual Machines using VMware Virtual Center.
  • Responsible for software installation, configuration, software upgrades, backup and recovery, commissioning and decommissioning data nodes, cluster setup, cluster performance and monitoring on daily basis, maintaining cluster on healthy on different Hadoop distributions (Hortonworks& Cloudera)
  • Worked with application teams to install operating system, updates, patches, version upgrades as required.
  • Responsible for importing and exporting data into HDFS and Hive.
  • Analyzed data using Hadoop components Hive and Pig.
  • Responsible for writing Pig scripts to process the data in the integration environment
  • Responsible for setting up HBASE and storing data into HBASE
  • Responsible for managing and reviewing Hadoop log files
  • Responsible for running Hadoop streaming jobs to process terabytes of xml's data.
  • Written MapReduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase - Hive Integration. Worked on YUM configuration and package installation through YUM.
  • Developed simple and complex MapReduce programs in Java for Data Analysis.
  • Designed and architected in building New Hadoop Cluster.
  • Installed Kerberos secured kafka cluster with no encryption on Dev and Prod. Also set up Kafka ACL's into it
  • Installed Docker for utilizing ELK, Influxdb, and Kerberos.
  • Created Database on InfluxDB also worked on Interface, created for Kafka also checked the measurements on Databases
  • Succeeded in deploying of ElasticSearch 5.3.0, Influx DB 1.2 on the Prod machine in a Docker container.
  • Tested all services like Hadoop, ZK, Spark, Hive SERVER & Hive MetaStore.
  • Upgraded Elastic search from 5.3.0 to 5.3.2 following the rolling upgrade process and using ansible to deploy new packages in Prod Cluster.

BigData Engineer -Hadoop Administrator

Confidential, Philadelphia, PA


  • Responsible for implementation and support of the Enterprise Hadoop environment.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Used Scala functional programming concepts to develop business logic.
  • Spark scripts by using Scala shell commands as per the requirement.
  • Processing the schema oriented and non-schema oriented data using Scala and Spark.
  • Developed and designed system to collect data from multiple portal using Kafka and then process it using spark.
  • Worked on MicroStrategy report development, analysis, providing mentoring, guidance and troubleshooting to analysis team members in solving complex reporting and analytical problems.
  • Extensively used filters, facts, Consolidations, Transformations and Custom Groups to generate reports for Business analysis.
  • Leveraged with the design and development of MicroStrategy dashboards and interactive documents using MicroStrategy web and mobile.
  • Extracted data from SQL Server 2008 into data marts, views, and/or flat files for Tableau workbook consumption using T-SQL. Partitioned and queried the data in Hive for further analysis by the BI team.
  • Managed Tableau extracts on Tableau Server and administered Tableau Server.
  • Extensively worked in data Extraction, Transformation and Loading data using BTEQ, Fast load, Multiload from Oracle to Teradata
  • Extensively used the Teradata fast load/Multiload utilities to load data into tables
  • Used Teradata SQL Assistant to build the SQL queries
  • Did data reconciliation in various source systems and in Teradata.
  • Involved in writing complex SQL queries using correlated sub queries, joins, and recursive queries.
  • Worked extensively on date manipulations in Teradata.
  • Extracted the data from oracle using sql scripts and loaded into teradata using fast/multi load and transformed according to business transformation rules to insert/update the data in data marts.
  • Installation and configuration, Hadoop Cluster and Maintenance, Cluster Monitoring, Troubleshooting and certifying environments for production readiness.
  • Experience in Implementing Hadoop Cluster Capacity Planning
  • Involved in the installation of CDH5 and up-gradation from CDH4 to CDH5
  • Cloudera Manager Up gradation from 5.3. to 5.5 version
  • Extensive experience in cluster planning, installing, configuring and administrating Hadoop cluster for major Hadoop distibutions like Cloudera and Hortonworks.
  • Installing, Upgrading and Managing Hadoop Cluster on Hortonworks
  • Hands on experience using Cloudera and Hortonworks Hadoop Distributions.
  • Created POC on Hortonworks and suggested the best practice in terms HDP, HDF platform, NIFI
  • Set up Hortonworks Infrastructure from configuring clusters to Node.
  • Worked with release management technologies such as Jenkins, github, gitlab and Ansible
  • Worked in Devops model, Continuous Integration and Continuous Deployment (CICD), automated deployments using Jenkins and Ansi
  • Complete end to end design and development of Apache Nifi flow which acts as the agent between middleware team and EBI team and executes all the actions mentioned above.
  • Responsible on-boarding new users to the Hadoop cluster (adding user a home directory and providing access to the datasets).
  • Helped the users in production deployments throughout the process.
  • Managed and reviewed Hadoop Log files as a part of administration for troubleshooting purposes. Communicate and escalate issues appropriately.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Continuous monitoring and managing the Hadoop cluster through Ganglia and Nagios.
  • Installed Oozie workflow engine to run multiple Hive and Pig jobs, which run independently with time and data availability.
  • Involved in Strom Batch-mode processing over massive data sets which is analogous to a Hadoop job that runs as a batch process over a fixed data set.
  • Developed data pipeline using Flume, Sqoop, Pig and Java Map Reduce to ingest customer behavioral data into HDFS for analysis.
  • Involved Storm terminology created a topology that runs continuously over a stream of incoming data.
  • Integrated Hadoop with Active Directory and enabled Kerberos for Authentication.
  • Upgraded the Cloudera Hadoop ecosystems in the cluster using Cloudera distribution packages.
  • Done stress and performance testing, benchmark for the cluster.
  • Commissioned and decommissioned the Data Nodes in the cluster in case of the problems.
  • Debug and solve the major issues with Cloudera manager by interacting with the Cloudera team.
  • Monitoring the System activity, Performance, Resource utilization.
  • Deep understanding of monitoring and troubleshooting mission critical Linux machines.
  • Kafka- Used for building real-time data pipelines between clusters.
  • Ran Log aggregations, website Activity tracking and commit log for distributed system using Apache Kafka
  • Design and Implemented Amazon Web Services As a passionate advocate of AWS within Grace note, migrated from a physical data center environment to
  • Focused on high-availability, fault tolerance, and auto-scaling.
  • Managed critical bundles and patches on the production servers after successfully navigating through the testing phase in the test environments.
  • Managing Disk File Systems, Server Performance, Users Creation and Granting file access Permissions and RAID configurations.
  • Integrated Apache Kafka for data ingestion
  • Configured Domain Name System (DNS) for hostname to IP resolution.
  • Involved in data migration from Oracle database to MongoDB.
  • Queried and analyzed data from Cassandra for quick searching, sorting and grouping
  • Involved in moving all log files generated from various sources to HDFS for further processing through Flume.
  • Preparation of operational testing scripts for Log check, Backup and recovery and Failover.
  • Troubleshooting and fixing the issues at User level, System level and Network level by using various tools and utilities.

BigData Operations Engineer - Consultant

Confidential, Indianapolis, IN


  • Cluster Administration, releases and upgrades Managed multiple Hadoop clusters with the highest capacity of 7 PB (400+ nodes) with PAM Enabled Worked on Hortonworks Distribution.
  • Responsible for implementation and ongoing administration of Hadoop infrastructure.
  • Using Hadoop cluster as a staging environment for the data from heterogeneous sources in data import process
  • Configured High Availability on the name node for the Hadoop cluster - part of the disaster recovery roadmap.
  • Configured Ganglia and Nagios to monitor the cluster and on-call with EOC for support.
  • Involved working on Cloud architecture.
  • Performed both Major and Minor upgrades to the existing cluster and also rolling back to the previous version.
  • Implemented Commissioning and Decommissioning of data nodes, killing the unresponsive task tracker and dealing with blacklisted task trackers.
  • Implemented Fair scheduler on the job tracker to allocate the fair amount of resources to small jobs.
  • Maintained, audited and built new clusters for testing purposes using the AMBARI, HORTONWORKS.
  • Created POC on Hortonworks and suggested the best practice in terms HDP, HDF platform, NIFI
  • Set up Hortonworks Infrastructure from configuring clusters to Node
  • Installed Ambari server on the clouds
  • Setup security using Kerberos and AD on Hortonworks clusters
  • Designed and allocated HDFS quotas for multiple groups.
  • Configured Flume for efficiently collecting, aggregating and moving large amounts of log Data from Many different sources to the HDFS.
  • Upgraded from HDP 2.2 to HDP 2.3 Manually in Software patches and upgrades.
  • Scripting Hadoop package installation and configuration to support fully automated deployments.
  • Configuring Rack Awareness on HDP.
  • Adding new Nodes to an existing cluster, recovering from a Name Node failure.
  • Instrumental in building scalable distributed data solutions using Hadoop eco-system.
  • Adding new Data Nodes when needed and re-balancing the cluster.
  • Handled importing of data from various data sources, performed transformations using Hive, MapReduce, loaded data into HDFS and Extracted the data from MySQL into HDFS using Sqoop.
  • Involved working in Database backup and recovery, Database connectivity and security.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team.
  • Utilization based on the running statistics of Map and Reduce tasks.
  • Changes to the configuration properties of the cluster based on volume of the data being processed and performance of the cluster.
  • Inputs to development regarding the efficient utilization of resources like memory and CPU utilization.

Hadoop Admin/ Linux Administrator

Confidential, CHICAGO, IL


  • Installation and configuration of Linux for new build environment.
  • Day-to- day - user access, permissions, Installing and Maintaining Linux Servers.
  • Created volume groups logical volumes and partitions on the Linux servers and mounted file systems and created partitions
  • Experienced in Installation and configuration Cloudera CDH4 in testing environment.
  • Resolved tickets submitted by users, P1 issues, troubleshoot the errors, resolving the errors.
  • Balancing HDFS manually to decrease network utilization and increase job performance.
  • Responsible for building scalable distributed data solutions using Hadoop.
  • Done major and minor upgrades to the Hadoop cluster.
  • Upgraded the Cloudera Hadoop ecosystems in the cluster using Cloudera distribution packages.
  • Use of Sqoop to Import and export data from HDFS to RDMS vice-versa.
  • Done stress and performance testing, benchmark for the cluster.
  • Commissioned and decommissioned the Data Nodes in the cluster in case of the problems.
  • Debug and solve the major issues with Cloudera manager by interacting with the Cloudera team.
  • Installed Cent OS using Pre-Execution environment boot and Kick start method on multiple servers, remote installation of Linux using PXE boot.
  • Monitoring the System activity, Performance, Resource utilization.
  • Develop and optimize physical design of MySQL database systems.
  • Deep understanding of monitoring and troubleshooting mission critical Linux machines.
  • Responsible for maintenance Raid-Groups, LUN Assignments as per agreed design documents.
  • Extensive use of LVM, creating Volume Groups, Logical volumes.
  • Performed Red Hat Package Manager (RPM) and YUM package installations, patch and other server management.
  • Tested and Performed enterprise wide installation, configuration and support for hadoop using MapR Distribution.
  • Setting up cluster and installing all the ecosystem components through MapR and manually through command line in Lab Cluster
  • Set up automated processes to archive/clean the unwanted data on the cluster, in particular on Name node and Secondary name node.
  • Involved in estimation and setting-up Hadoop Cluster in Linux.
  • Prepared PIG scripts to validate Time Series Rollup Algorithm.
  • Responsible for support, troubleshooting of Map Reduce Jobs, Pig Jobs and maintaining Incremental Loads at daily, weekly and monthly basis.
  • Implemented Oozie workflows for Map Reduce, Hive and Sqoop actions.
  • Channelized Map Reduce outputs based on requirement using Practitioners
  • Performed scheduled backup and necessary restoration.
  • Build and maintain scalable data using the Hadoop ecosystem and other open source components like Hive and HBase.
  • Monitor the data streaming between web sources and HDFS.

Linux/Unix Administrator



  • Experience installing, upgrading and configuring RedHat Linux 4.x, 5.x, 6.x using Kickstart Servers and Interactive Installation
  • Responsible for creating and managing user accounts, security, rights, disk space and process monitoring in Solaris, CentOS and Redhat Linux
  • Performed administration and monitored job processes using associated commands
  • Manages systems routine backup, scheduling jobs and enabling cron jobs
  • Maintaining and troubleshooting network connectivity
  • Manages Patches configuration, version control, service pack and reviews connectivity issues regarding security problem
  • Configures DNS, NFS, FTP, remote access, and security management, Server hardening
  • Installs, upgrades and manages packages via RPM and YUM package management
  • Logical Volume Management maintenance
  • Experience administering, installing, configuring and maintaining Linux
  • Creates Linux Virtual Machines using VMware Virtual Center dministers VMware Infrastructure Client 3.5 and Vsphere 4.1
  • Installs Firmware Upgrades, kernel patches, systems configuration, performance tuning on Unix/Linux systems
  • Installing Red Hat Linux 5/6 using kickstart servers and interactive installation.
  • Supporting infrastructure environment comprising of RHEL and Solaris.
  • Installation, Configuration, and OS upgrades on RHEL 5.X/6.X/7.X, SUSE 11.X, 12.X.
  • Implemented and administered VMware ESX 4.x 5.x and 6 for running the Windows, Centos, SUSE and Red Hat Linux Servers on development and test servers.
  • Create, extend, reduce and administration of Logical Volume Manager (LVM) in RHEL environment.
  • Responsible for large-scale Puppet implementation and maintenance. Puppet manifests creation, testing and implementation.

Hire Now