Hadoop Engineer Resume
Neenah, WI
SUMMARY:
- Above 8 years of professional experience which includes Analysis, Design, Development, Integration, Deployment and Maintenance of quality software applications using Java/J2EE Technologies and Big data Hadoop technologies.
- Above 4 Years of working experience in data analysis and data mining using Big Data Stack.
- Proficiency in Java, Hadoop Map Reduce, Pig, Hive, Oozie, Sqoop, Flume, HBase, Scala, Spark, Kafka, Storm, Impala and NoSQL Databases.
- High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of Map Reduce and the Hadoop Infrastructure.
- Excellent knowledge on Hadoop Architecture and ecosystems such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node, YARN and Map Reduce programming paradigm.
- Good exposure on usage of NoSQL databases column-oriented HBase and Cassandra.
- Extensive experienced in working with semi/unstructured data by implementing complex map reduce programs using design patterns.
- Extensive experience writing custom Map Reduce programs for data processing and UDFs for both Hive and Pig in Java.
- Strong experience in analyzing large amounts of data sets writing Pig scripts and Hive queries.
- Extensive experienced in working with structured data using Hive QL, join operations, writing custom UDF’s and experienced in optimizing Hive Queries.
- Experience in importing and exporting data using Sqoop from HDFS to Relational Database.
- Experienced in job workflow scheduling and monitoring tools like Oozie.
- Experience in Apache Flume for collecting, aggregating and moving huge chunks of data from various sources such as webserver, telnet sources etc.
- Hands on experience in major Big Data components Apache Kafka, Apache spark, Zookeeper, Avro.
- Experienced in implementing unified data platforms using Kafka producers/ consumers, implement pre-processing using storm topologies.
- Experienced in migrating map reduce programs into Spark RDD transformations, actions to improve performance.
- Experience with using Big Data with ETL (Talend).
- Experience with ETL - Extract Transform and Load - Talend Open Studio, Informatica.
- Strong experience in architecting real time streaming applications and batch style large scale distributed computing applications using tools like Spark Streaming, Spark SQL, Kafka, Flume, Map reduce, Hive etc.
- Experience using various Hadoop Distributions (Cloudera, Hortonworks, MapR etc.) to fully implement and leverage new Hadoop features
- Worked on custom Pig Loaders and Storage classes to work with a variety of data formats such as JSON, Compressed CSV, etc.,
- Good Knowledge in Amazon AWS concepts like EMR and EC2webservices which provides fast and efficient processing of Big Data.
- Experienced in working with different scripting technologies like Python, Unix shell scripts.
- Experience on Source control repositories like SVN, CVS and GIT.
- Strong experienced in working with UNIX/LINUX environments, writing shell scripts.
- Skilled at build/deploy multi module applications using Maven, Ant and servers like Jenkins.
- Adequate knowledge and working experience in Agile & Waterfall methodologies.
- Excellent problem solving, and analytical skills.
TECHNICAL SKILLS:
Hadoop/Big Data Technologies: HDFS, Map Reduce, Sqoop, Flume, Pig, Hive, Oozie, impala, Apache Nifi, Zookeeper, Cloudera Manager, Ambari.
NoSQL Database: MongoDB, Cassandra,Real Time/Stream processing, Apache Storm, Apache Spark,Distributed message broker, Apache Kafka, Monitoring and Reporting,Tableau, Custom shell scripts., Hadoop Distribution, Horton Works, Cloudera, MapR.
Build Tools: Maven, SQL Developer,Programming & Scripting,JAVA, C, SQL, Shell Scripting, Python
Databases: Oracle, MY SQL, MS SQL server
Web Dev. Technologies: HTML, XML, JSON, CSS, JQUERY, JavaScript, angular JS
Tools: & Utilities: Eclipse, MQ explorer, RFH util, SSRS, Aqua Data Studio, XML Spy, ETL(talend)
Operating Systems: Linux, Unix, Mac OS-X, Windows 10, Windows 8, Windows 7, Windows Server 2008/2003
PROFESSIONAL EXPERIENCE:
Confidential,Neenah, WI
Hadoop Engineer
Responsibilities:- Worked on installing cluster, commissioning and decommissioning of Data Nodes, Name Node recover, capacity planning in the cloud environment (Microsoft Azure).
- Managed Hadoop cluster with 29 nodes having HDP(Hortonworks) distribution using Ambari and HDP 2.6 leveraging the cloud environment from Microsoft Azure.
- Used a tool called Cloudbreak for provisioning and managing Apache Hadoop clusters in the cloud (Microsoft Azure). Cloudbreak, as part of the Hortonworks Data Platform, makes it easy to provision, configure and elastically grow HDP clusters on cloud infrastructure.
- Monitored cluster for performance, networking and data integrity issues. Responsible for troubleshooting issues in the execution of MapReduce jobs by inspecting and reviewing log files.
- Formulated procedures for installation of Hadoop patches, updates and version upgrades .
- Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contains the click stream data from Mixpanel and Google Analytics .
- Used Apache Nifi for ingestion of data from Mixpanel API on to HDFS in raw JSON format.
- Developed optimal strategies for distributing the click stream data over the cluster by importing the data into HDFS through connecting to the Mixpanel, Google Analytics API.
- Developed custom shell scripts to connect to the Mixpanel and Google Analytics API and used Crontab for scheduling purposes.
- Designed and implemented Hive queries and functions for evaluation, filtering, loading and storing of data.
- Developed hive tables on top of the consumed JSON data from Mixpanel API and stored them in ORC format for optimized querying in tableau.
- Used custom shell scripts to convert the Google Analytics data format(dic) to JSON and then dumped it on HDFS for further analytics.
- Worked Hive database to provide both Historical and live clickstream data from Mixpanel and Google Analytics to tableau for historical and live reporting.
Environment: Hortonworks Data Platform (HDP), Hortonworks Data Flow(HDF), Hadoop, HDFS, Spark, Hive, MapReduce, Apache Nifi, Tableau Desktop, Linux, Microsoft Azure, Cloudbreak.
Confidential,Jacksonville, FL
Big Data Developer
Responsibilities:- Collected and aggregated large amounts of data from different sources such as COSMA ( Confidential Onboard System Management Agent), BOMR (Back Office Message Router), ITCM (Interoperable train control messaging), Onboard mobile and network devices from the PTC (Positive Train Control) network using Apache Nifi and stored the data into HDFS for analysis.
- Used Apache Nifi for ingestion of data from the IBM MQ’s (Messages Queue).
- Implemented Nifi flow topologies to perform cleansing operations before moving data into HDFS.
- Developed Java Map Reduce programs on ITCM log data to transform into structured way.
- Developed optimal strategies for distributing the ITCM log data over the cluster; importing and exporting the stored log data into HDFS and Hive using Apache Nifi.
- Developed custom code to read the messages of the IBM MQ and to dump them onto the Nifi Queues.
- Worked with the Apache Nifi flow to perform the conversion of Raw XML data into JSON, AVRO.
- Implemented Hive Generic UDF’s to in corporate business logic into Hive Queries.
- Configuring Spark Streaming to receive real time data from IBM MQ and store the stream data to HDFS.
- Analyzed the Bandwidth data from the locomotive using the HiveQL to extract the Bandwidth consumed by each locomotive in a day using different carriers AT&T, Verizon or Wi-Fi.
- Designed and implemented Hive queries and functions for evaluation, filtering, loading and storing of data.
- Installed and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contains the Bandwidth data form the locomotive through the Hortonworks ODBC connector for further analytics of the data.
- Collected and provided locomotive communication usage data by locomotive, channel, protocol and by application.
- Analyzed the Locomotive Communication Usage from COSMA to monitor in/out-bound traffic bandwidth by communication channel.
- Worked on back-end Hive database to provide both Historical and live Bandwidth data from the locomotives to tableau for historical and live reporting.
Environment: Hortonworks Data Platform (HDP), Hortonworks Data Flow(HDF), Hadoop, HDFS, Spark, Hive, MapReduce, Apache Nifi, Tableau Desktop, Linux.
Confidential, Houston, TX
Big Data Systems Engineer
Responsibilities:- Installed and configured a three-node cluster with Hortonworks Data Platform (HDP 2.3) on the HP infrastructure and Management.
- Worked with HP Intelligent provisioning and the smart storage array for setting up the disks for the installation.
- Used a Big Data Benchmark tool called BigBench to benchmark the three-node cluster.
- Configured the tool BigBench and had it running on one of the nodes in the cluster.
- Ran the Benchmark for different Datasets of 5GB, 10GB, 50 GB, 100 GB and 1 TB.
- Worked with structured, semi-structured and unstructured data which is automated in the tool BigBench having to run with the workloads using Spark ’s machine learning libraries.
- Configured a PAT (Performance Analysis Tool) for having the benchmark results dumped into the automated charts using MS-Excel .
- Used Ambari Server for monitoring the cluster while the benchmark is running.
- Worked with different teams to install operating system, Hadoop updates, patches, version upgrades of Hortonworks as required.
- Collected the performance metrics from Hadoop nodes, to analyze the resource utilization and draw automated charts using MS-Excel, a Performance Analysis Tool (PAT) was used .
- Worked with various performance monitoring tools like top, dstat, atop and also Ambari metrics.
- Collected the results from the different Datasets (5GB, 10GB, 50GB, 100GB and 1TB) tests on the Server and was able to dump them on to the PAT (Performance Analysis Tool) for further analyzing the resource utilization .
- Had a chance to work with HPE insight CMU (Cluster Management Utility) for managing the cluster and also HPE Vertica for SQL on Hadoop.
- Worked on configuring the performance tuning parameters used during the benchmark.
- Used Tableau Desktop for creating visual Dashboards of CPU utilization, Disk IO, Memory, Network IO and Query Times obtained from the PAT (Performance Analysis tool) automated charts using MS-Excel.
- Had the results obtained from the benchmark output in terms of automated charts being dumped into Tableau Desktop for further data analytics.
- Installed and configured Tableau Desktop on one of the three nodes to connect to the Hortonworks Hive Framework (Database) through the Hortonworks ODBC connector for further analytics of the cluster.
Environment: Hortonworks Data Platform (HDP), Hadoop, HDFS, Spark, Hive, MapReduce, BigBench, Tableau Desktop, Linux.
Confidential, Monroeville, PA
Sr. Big Data/Hadoop Developer
Responsibilities:- Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache Kafka and stored the data into HDFS for analysis.
- Implemented Storm builder topologies to perform cleansing operations before moving data into Cassandra.
- Developed Java Map Reduce programs on log data to transform into structured way.
- Developed optimal strategies for distributing the web log data over the cluster; importing and exporting the stored web log data into HDFS and Hive using Sqoop.
- Implemented Hive Generic UDF’s to in corporate business logic into Hive Queries.
- Configuring Spark Streaming to receive real time data from the Kafka and Store the stream data to HDFS.
- Worked with Spark to create structured data from the pool of unstructured data received.
- Converting Hive Queries to SparkSQL and using parquet file as the storage format.
- Implemented Spark RDD transformations, actions to migrate Map reduce algorithms.
- Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page views, visit duration, most visited page on website.
- Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java programs and shell scripts).
- Designed and implemented HIVE queries and functions for evaluation, filtering, loading and storing of data.
- Familiar with ETL (talend) and data integration designed for IT and BI analysts to schedule.
- Creating Hive tables and working on them using Hive QL.
- Monitored workload, job performance and capacity planning using Cloudera Manager.
- Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs.
- Worked collaboratively with all levels of business stakeholders to architect, implement and test Big Data based analytical solution from disparate sources.
- Involved in complete SDLC of project including requirements gathering, design documents, development, testing and production environments.
- Involved in Agile methodologies, daily scrum meetings, sprint planning.
Environment: Hadoop, HDFS, Map Reduce, Hive, Sqoop, Spark, Scala, Kafka, Oozie, Storm, Cassandra, Maven, Shell Scripting, CDH.
Confidential, Springfield, IL
Big Data/Hadoop Developer
Responsibilities:- Involved in creating Hive tables, and loading and analyzing data using hive queries.
- Developed Simple to complex Map Reduce Jobs using Hive and Pig.
- Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with Pig.
- Migrated complex map reduce programs into in memory Spark processing using Transformations and actions.
- Mentored analyst and test team for writing Hive Queries.
- Developed multiple Map Reduce jobs in java for data cleaning and preprocessing
- Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.
- Used ETL(talend) for Extraction, Transformation and Loading of data from multiple sources.
- Implemented Kafka consumers to move data from Kafka partitions into Cassandra for near real time analysis.
- Used Cassandra Query language (CQL) to implement CRUD operations on Cassandra file system.
- Develop and maintains complex outbound notification applications that run on custom architectures, using diverse technologies including Core Java, J2EE, SOAP, XML, JMS, JBoss and Web Services.
- Experienced in running Hadoop streaming jobs to process terabytes of xml format data.
- Load and transform large sets of structured, semi-structured and unstructured data.
- Generated the datasets and loaded to HADOOP Ecosystem.
- Assisted in exporting analyzed data to relational databases using Sqoop.
- Created and maintained Technical documentation for launching HADOOP Clusters and for executing Hive queries, Pig Scripts, Sqoop jobs.
Environment: Horton works, Hadoop, HDFS, Spark, Oozie, Pig, Hive, MapReduce, Sqoop, Cassandra, Linux.