Hadoop Monitoring best Practices

1. How to monitor the$H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

From 3 3000 Nodes

Hardware/Software failures common

Redundant Components DataNode, TaskTracker

Non-redundant Components NameNode, JobTracker, SecondaryNameNode

Fast Evolving Technology (Best Practices?)

Nagios

Red Yellow Green Alerts, Escalations

Defacto Standard Widely deployed

Text base configuration

Web Interface

Pluggable with shell scripts/external apps

Return 0 - OK

Performance Graphing System

RRD/RRA Front End

Slick Web Interface

Template System for Graph Types

Pluggable

SNMP input

Shell script /external program

JMX Fetching Code w/ (kick off) scripts

Cacti templates For Hadoop

Premade Nagios Check Scripts

Helper/Batch/automation scripts

Apache License

NameNode & SecNameNode

Hardware RAID

8 GB RAM

1x QUAD CORE

DerbyDB (hive) on SecNameNode

JobTracker

8GB RAM

1x QUAD CORE

Slave (hadoopdata1-XXXX)

JBOD 8x 1TB SATA Disk

RAM 16GB

2x Quad Core

Nagios (install) DAG RPMs

Cacti (install) Several RPMS

Liberal network access to the cluster

X nodes * Y Services = < Sleep

Define a policy

Wake Me Ups (SMS)

Dont Wake Me Ups (EMAIL)

Review (Daily, Weekly, Monthly)

NameNode

Disk Full (Big Big Headache)

RAID Array Issues (failed disk)

JobTracker

SecNameNode

Do not realize it is not working too late

Or Wake someone else up

DataNode

Warning Currently Failed Disk will down the Data Node (see Jira)

TaskTracker

Hardware

Bad Disk (Start RMA)

Slaves are expendable (up to a point)

Start With the Basics

Ping, Disk

Add Hadoop Specific Alarms

check_data_node

Add JMX Graphing

NameNodeOperations

Add JMX Based alarms

FilesTotal > 1,000,000 or LiveNodes < 50%

Nagios (All Nodes)

Host up (Ping check)

Disk % Full

SWAP > 85 %

* Load based alarms are somewhat useless389% CPU load is not necessarily a bad thing in Hadoopville

Cacti (All Nodes)

CPU (full CPU)

RAM/SWAP

Network

Disk Usage

Hpacucli not a Street Fighter move

Alerts on RAID events (NameNode)

Disk failed

Rebuilding

JBOD (DataNode)

Failed Drive

Drive Errors

Dell, SUN, Vendor Specific Tools

X Nodes * Y Checks * = Lots of work

About 3 Nodes into the process

Wait!!! I need some interns!!!

Solution S.I.C.C.T.Semi-Intelligent-Configuration-cloning-tools

(I made that up)

(for this presentation)

Answers IS IT RUNNING?

Text based Configuration

Answers HOW WELL IS IT RUNNING?

Web Based configuration

php-cli tools

Ping, Disk !!!!!!Done!!!!!!

check_data_node

Add JMX Graphing

NameNodeOperations

Hadoop Components with a Web Interface

NameNode 50070

JobTracker 50030

TaskTracker 50060

DataNode 50075

check_http + regex = simple + effective

Component Failure

(Future) Newer Hadoop will have XML status

Ping, Disk (Done)

check_data_node (Done)

Add JMX Graphing

NameNodeOperations

Enable JMX

Import Templates

Start With the Basics !!!!!!Done!!!!!

Ping, Disk

Add Hadoop Specific Alarms !Done!

check_data_node

Add JMX Graphing !Done!

NameNodeOperations

hadoop-cacti-jtg is flexible

extend fetch classes

Dont call output()

Write your own check logic

url, user, pass, object specified from CLI

wantedVariables, wantedOperations by inheritance

fetch() output() provided

Start With the Basics !DONE!

Ping, Disk

Add Hadoop Specific Alarms !DONE!

check_data_node

Add JMX Graphing !DONE!

NameNodeOperations

Add JMX Based alarms !DONE!

File System Growth

Number of Files

Number of Blocks

Ratios

Utilization

CPU/Memory

Email (nightly)

DSFADMIN

JMX Coming to JobTracker and TaskTracker (0.21)

Collect and Graph Jobs Running

Collect and Graph Map / Reduce per node

Profile Specific Jobs in Cacti?