1www.cs.wisc.edu/condor
HawkEyeA Monitoring and
Management Tool for
Distributed Systems Todd Tannenbaum
Department of Computer SciencesUniversity of Wisconsin-Madisonhttp://www.cs.wisc.edu/condor
2www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system
3www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system› …lots of core technology for monitoring
the status of a machine
4www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system› …lots of core technology for monitoring
the status of a machine› …lots of core technology for managing
a work load of tasks
5www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system› …lots of core technology for monitoring
the status of a machine› …lots of core technology for managing
a work load of tasks› …lots of really, truly, skilled and
experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest. Email for Wisconsin Gov Scott McCallum:
7www.cs.wisc.edu/condor
One day an avid Condor user asked:
Say, could Condor Technology be
used for distributed
system administration??
8www.cs.wisc.edu/condor
Time to think…› Gathered up our experiences with
our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born.
› Completely separate from Condor from end user prospective. Can install HawkEye, or Condor, or both
9www.cs.wisc.edu/condor
First Component: MONITORING
› Sysadmins first need information about what is happening on the machines they are responsible for. Both Current and Past Information must be consolidated and
easily accessible Information must be dynamic
10www.cs.wisc.edu/condor
Condor ClassAds› Technology for an entity to
describe itself
› Simple attribute value pairs [
load_average = 1.3free_Swap_space_mb = 140number_of_processes = 92keyboard_idle_secs = 6ram = 128total_swap = 512total_memory = ram + total_swapbusy = load_average > 1.0
]
11www.cs.wisc.edu/condor
Condor ClassAds, cont.› No fixed schema› Attributes can contain values or
expressions› Serialize Ads in XML› Open source libraries on C++ and Java
to: Manipulate Ads and Ad attributes Store Ads Query collections of Ads
› Bindings for Perl and others on the way…
12www.cs.wisc.edu/condor
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Manager ClassAd
UpdatesVia SecureUDP
13www.cs.wisc.edu/condor
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Manager HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Monitoring Agent
14www.cs.wisc.edu/condor
HawkEye Monitoring Agent
/proc, kstat…
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Monitoring Agent
HawkEye Manager ClassAd
UpdatesVia SecureUDP
15www.cs.wisc.edu/condor
Monitor Agent, cont.
› Updates are sent periodically Information does not get stale
› Updates also serve as a heartbeat monitor Know when a machine is down
› Out of the box, the update ClassAd has many attributes about the machine of interest for system administration Current Prototype = 184 attributes
17www.cs.wisc.edu/condor
Custom Attributes
/proc, kstat…
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Monitoring Agent
HawkEye Manager
Data from hawkeye_update_attribute
command line tool
Create your ownHawkEye plugins,or share plugins with others
18www.cs.wisc.edu/condor
Role of HawkEye Manager
› Store all incoming ClassAds in a indexed resident data structure Fast response to client tool queries about
current state “Show me all machines with a load average >
10”
› Periodically store ClassAd attributes into a Round Robin Database Store information over time “Show me a graph with the load average for
this machine over the past week”
› Speak to clients via CEDAR, HTTP
HawkEye Manager
20www.cs.wisc.edu/condor
But sysadmins also sometimes have to do
work…
› Task: copy a new library onto the local disk of each machine. Just a script to copy via rcp/scp to
every machine… or is it?
21www.cs.wisc.edu/condor
Running tasks on behalf of the sysadmin
› Submit your sysadmin tasks to HawkEye Tasks are stored in a persistent queue by
the Manager Tasks can leave the queue upon completion,
or repeat after specified intervals Tasks can have complex interdependencies
via DAGMan Records are kept on which task ran where
› Sounds like Condor, eh? Yes, but simpler…
22www.cs.wisc.edu/condor
Run Tasks in response to monitoring information› ClassAd “Requirements” Attribute
› Example: Send email if a machine is low on disk space or low on swap space Submit an email task with an attribute:
Requirements = free_disk < 5 || free_swap < 5
› Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape
23www.cs.wisc.edu/condor
HawkEye Design Goals› Monitoring
Reliable presence Get Data off the node in an extensible, consistent
manner
› Run Tasks In response to probe information Repeat or once-only semantics Audit Log
› Independent and self-contained› Cross-Platform
24www.cs.wisc.edu/condor
Current Status
› Just Beginning this project
› Initial release early summer
› Prototypes already running – Stop in and see initial HawkEye Work
Rm 3385 on Weds 9am – 12pm