View
219
Download
0
Embed Size (px)
Citation preview
The Network Weather Service
A Distributed Resource Performance Forecasting Service
for Metacomputing
Rich Wolski, Neil T. Spring and Jim HayesPresented By: Mohammad Al-Saeed
Organization
Introduction Motivation: why the NWS? The NWS: what is the NWS?
Related work NWS system architecture
Design goals System components
NWS components NWS interface Conclusion and future work
Motivation
Searching for the environment that delivers the most
Dynamic nature of metacomputing environments
Adaptive applications Adapt to changing environments Knowledge needed for adaptation
Resource discovery and allocation
The Network Weather Service
A distributed system for producing short-term deliverable performance forecasts
Goal: dynamically measure and forecast the performance deliverable at the application level from a set of network resources
Measurements currently supported: Available fraction of CPU time End-to-end TCP connection time End-to-end TCP network latency End-to-end TCP network bandwidth
Related Work
TReno: performance at transport layer using TCP
Pathchar: bandwidth over a path bprobe/cprobe: bottleneck link speed and
competing traffic Topology-d: uses ping and netperf to find
bandwidth between hosts in a group then analyzes this data to find minimum-cost logical topology
ReMoS: network resource monitoring
NWS System Architecture
Design objectives Scalability: scales to any metacomputing
infrastructure Predictive accuracy: provides accurate
measurements and forecasts Non-intrusiveness: shouldn’t load the
resources it monitors Execution longevity: available all time Ubiquity: accessible from everywhere,
monitors all resources
System Components
Four different component processes Persistent State process: handles storage
of measurements Name Server process: directory server for
the system Sensor processes: measure current
performance of different resources Forecaster process: predicts deliverable
performance of a resource during a given time
NWS Components
Persistent State Management Naming Server Performance Monitoring: NWS Sensors
CPU Sensor Network Sensor
Sensor Control Cliques: hierarchy and contention Adaptive time-out discovery
Forecasting Forecaster and forecasting models Sample forecaster results
Persistent State Management
All NWS processes are stateless The system state (measurements) are
managed by the PS process: Storage & retrieval of measurements Measurements are time-stamped plain-
text strings Measurements are written to disk
immediately and acknowledged Measurements are stored in a circular
queue of tunable size
Naming Server
Primitive text string directory service for the NWS system
The only component known system-wide Information stored include
Name to IP binding information Group configuration Parameters for various processes
Each process must refresh its registration with the name server periodically
Centralized
Performance Monitoring
Actual monitoring is performed by a set of sensors
Accuracy vs. Intrusiveness A sensor’s life:
{ Register with the NS;
Query the NS for parameters;
Generate conditional test;
Forever { if conditions are met then
{ perform test;
time-stamp results and send them to the PS
refresh registration with the NS }
}
CPU Sensor
Measures available CPU fraction Testing tools:
Unix uptime: reports load average in the past x minutes
Unix vmstat: reports idle-, user- and system-time
Active probes Accuracy:
Results assume a full priority job Doesn’t know the priority of jobs in the
queue
Network Sensor
Carries network-related measurements Testing: using active network probes
Establish and release TCP connections Moving large (small) data to measure
bandwidth (delay) Measures connections with all peer sensors
Problems Accuracy: depends on socket interface Complexity: N2-N tests, collisions
(contention)
Network Sensor Control
Sensors are organized into sensor sets called cliques
Each clique is configurable and has one leader Clique sets are logical, but can be based on
physical topology Leaders are elected using a distributed
election protocol A sensor can participate in many cliques Advantages
Scalability by organizing cliques in a hierarchy Reduce the N2-N Accuracy by more frequent tests
Contention
Each leader maintains a clique token (and time between tokens)
The sensor that has the token performs all its tests then passes the token to the next sensor in the list
Adaptive time-out discovery Tokens have time-out field Tokens have sequence numbers The leader adaptively controls the time-
out
Forecaster Process
A forecasting driver and a set of compile-time prediction modules
Forecasting process: Fetching required measurements from
the PS Passing the time series to each prediction
module Choosing the best returned prediction
Incorporate sophisticated prediction techniques?
Sample Forecaster Results
UC Santa Barbara – Kansas State U.Recorded Bandwidth
UC Santa Barbara – Kansas State U.Forecasted Bandwidth
NWS Interface
C API Quick short-term forecasts for
applications InitForecaster() RequestForecasts()
CGI interface Continuous access to NWS forecasts
through the web Interactively produces graphs for
performance and forecasts http://nws.cs.utk.edu
Conclusion and Future Work
NWS is scalable, stable and always available NWS relies on adaptivity to achieve its design
goals NWS is open (adding sensors and forecasting
models) Current forecasting is excellent compared to
powerful sophisticated forecasting techniques Enhancements
Basing the NS on LDAP Automatic clique configuration Forecasting methodologies