39
How to monitor the How to monitor the $H!T out of Hadoop $H!T out of Hadoop Developing a Developing a comprehensive open comprehensive open approach to monitoring approach to monitoring hadoop clusters hadoop clusters

Hadoop Monitoring best Practices

Embed Size (px)

DESCRIPTION

Monitoring hadoop With Cacti and Nagios

Citation preview

  • 1. How to monitor the$H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

2. Relevant Hadoop Information

  • From 3 3000 Nodes
  • Hardware/Software failures common
  • Redundant Components DataNode, TaskTracker
  • Non-redundant Components NameNode, JobTracker, SecondaryNameNode
  • Fast Evolving Technology (Best Practices?)

3. Monitoring Software

  • Nagios
    • Red Yellow Green Alerts, Escalations
    • Defacto Standard Widely deployed
    • Text base configuration
    • Web Interface
    • Pluggable with shell scripts/external apps
      • Return 0 - OK

4. Cacti

  • Performance Graphing System
  • RRD/RRA Front End
  • Slick Web Interface
  • Template System for Graph Types
  • Pluggable
    • SNMP input
    • Shell script /external program

5. 6. hadoop-cacti-jtg

  • JMX Fetching Code w/ (kick off) scripts
  • Cacti templates For Hadoop
  • Premade Nagios Check Scripts
  • Helper/Batch/automation scripts
  • Apache License

7. Hadoop JMX 8. Sample Cluster P1

  • NameNode & SecNameNode
    • Hardware RAID
    • 8 GB RAM
    • 1x QUAD CORE
    • DerbyDB (hive) on SecNameNode
  • JobTracker
    • 8GB RAM
    • 1x QUAD CORE

9. A Sample Cluster p2

  • Slave (hadoopdata1-XXXX)
    • JBOD 8x 1TB SATA Disk
    • RAM 16GB
    • 2x Quad Core

10. Prerequisites

  • Nagios (install) DAG RPMs
  • Cacti (install) Several RPMS
  • Liberal network access to the cluster

11. Alerts & Escalations

  • X nodes * Y Services = < Sleep
  • Define a policy
    • Wake Me Ups (SMS)
    • Dont Wake Me Ups (EMAIL)
    • Review (Daily, Weekly, Monthly)

12. Wake Me Ups

  • NameNode
    • Disk Full (Big Big Headache)
    • RAID Array Issues (failed disk)
  • JobTracker
  • SecNameNode
    • Do not realize it is not working too late

13. Dont Wake Me Ups

  • Or Wake someone else up
  • DataNode
    • Warning Currently Failed Disk will down the Data Node (see Jira)
  • TaskTracker
  • Hardware
    • Bad Disk (Start RMA)
  • Slaves are expendable (up to a point)

14. Monitoring Battle Plan

  • Start With the Basics
    • Ping, Disk
  • Add Hadoop Specific Alarms
    • check_data_node
  • Add JMX Graphing
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

15. The Basics Nagios

  • Nagios (All Nodes)
    • Host up (Ping check)
    • Disk % Full
    • SWAP > 85 %
  • * Load based alarms are somewhat useless389% CPU load is not necessarily a bad thing in Hadoopville

16. The Basics Cacti

  • Cacti (All Nodes)
    • CPU (full CPU)
    • RAM/SWAP
    • Network
    • Disk Usage

17. Disk Utilization 18. RAID Tools

  • Hpacucli not a Street Fighter move
    • Alerts on RAID events (NameNode)
      • Disk failed
      • Rebuilding
    • JBOD (DataNode)
      • Failed Drive
      • Drive Errors
  • Dell, SUN, Vendor Specific Tools

19. Before you jump in

  • X Nodes * Y Checks * = Lots of work
  • About 3 Nodes into the process
    • Wait!!! I need some interns!!!
  • Solution S.I.C.C.T.Semi-Intelligent-Configuration-cloning-tools
    • (I made that up)
    • (for this presentation)

20. Nagios

  • Answers IS IT RUNNING?
  • Text based Configuration

21. Cacti

  • Answers HOW WELL IS IT RUNNING?
  • Web Based configuration
    • php-cli tools

22. Monitoring Battle Plan Thus Far

  • Start With the Basics
    • Ping, Disk !!!!!!Done!!!!!!
  • Add Hadoop Specific Alarms
    • check_data_node
  • Add JMX Graphing
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

23. Add Hadoop Specific Alarms

  • Hadoop Components with a Web Interface
    • NameNode 50070
    • JobTracker 50030
    • TaskTracker 50060
    • DataNode 50075
  • check_http + regex = simple + effective

24. nagios_check_commands.cfg

  • Component Failure
  • (Future) Newer Hadoop will have XML status

define command { command_namecheck_remote_namenode command_line$USER1$/check_http -H$HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070 } 25. Monitoring Battle Plan

  • Start With the Basics
    • Ping, Disk (Done)
  • Add Hadoop Specific Alarms
    • check_data_node (Done)
  • Add JMX Graphing
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

26. JMX Graphing

  • Enable JMX
  • Import Templates

27. JMX Graphing 28. JMX Graphing 29. JMX Graphing 30. 31. Standard Java JMX 32. Monitoring Battle Plan Thus Far

  • Start With the Basics !!!!!!Done!!!!!
    • Ping, Disk
  • Add Hadoop Specific Alarms !Done!
    • check_data_node
  • Add JMX Graphing !Done!
    • NameNodeOperations
  • Add JMX Based alarms
    • FilesTotal > 1,000,000 or LiveNodes < 50%

33. Add JMX based Alarms

  • hadoop-cacti-jtg is flexible
    • extend fetch classes
    • Dont call output()
    • Write your own check logic

34. Quick JMX Base Walkthrough

  • url, user, pass, object specified from CLI
  • wantedVariables, wantedOperations by inheritance
  • fetch() output() provided

35. Extend for NameNode 36. Extend for Nagios 37. Monitoring Battle Plan

  • Start With the Basics !DONE!
    • Ping, Disk
  • Add Hadoop Specific Alarms !DONE!
    • check_data_node
  • Add JMX Graphing !DONE!
    • NameNodeOperations
  • Add JMX Based alarms !DONE!
    • FilesTotal > 1,000,000 or LiveNodes < 50%

38. Review

  • File System Growth
    • Size
    • Number of Files
    • Number of Blocks
    • Ratios
  • Utilization
    • CPU/Memory
    • Disk
  • Email (nightly)
    • FSCK
    • DSFADMIN

39. The Future

  • JMX Coming to JobTracker and TaskTracker (0.21)
    • Collect and Graph Jobs Running
    • Collect and Graph Map / Reduce per node
    • Profile Specific Jobs in Cacti?