Upload
melina-johns
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
Memento:Efficient Monitoringof Sensor Network HealthStanislav Rost and Hari BalakrishnanCSAIL, MIT
SECON, September 2006
“Sed quis custodiet ipsos custodes?”
“But who watches the watchmen?”
- Juvenal, Satire VI
Goals and Challenges of Monitoring
• Goals– Accuracy: minimize false alarms, maintenance– Timeliness: repair quickly, preserve sensor
coverage– Efficiency: in power, bandwidth, to help longevity
• Challenges– Packet loss: inherent to wireless medium– Dynamic routing topology:
adapts to link quality– Resource constraints:
internal monitoring is not the primary application
Memento Monitoring Suite Breakdown
• Failure detection:which nodes have failed?
• Collection protocol: gathering network-wide health status
• Watchdogs• Logging• Remote inspection
Typical Sensornet Framework
• Assume routing protocol,optimized by path metric to root
• Example metric: ETX– expected transmission
count to reliably transfer a packet root node
• Periodic communication– Protocol
advertisements– Collection sweeps
(1 per sweep period)
Data collection server
Gateway node
Sensornodes
Two Modules of Memento• Failure detectors
– Track communication of a subset of neighbors
– Detect failures– Form liveness beliefs
• Collection protocol– Send liveness
updates to the root– Aggregate along the
way, vote on status by aggregation
?
??
? ??
Fail-stop failure node permanently stops communicating (until reset or repaired)
Heartbeats periodic beacons of other protocols; or Memento’s own
Known period of transmission Packets include the source address
Liveness Update a bitmap s.t. bit k = 1 some node in subtree thinks node k is alive
Scope of Opportunistic Monitoring all? children? some?
Failure Set Calculation at gateway,[roster – live]
Part I: Failure DetectionProblem Statement
• Given a maximum false positive rate parameter
develop a scheme which minimizes detection time
• Using distributed failure detection:every node is a participantmay monitor a number of other nodes
Adaptive Failure Detectors
• Declare neighbor failed after an abnormally long gap in sequence of heartbeat arrivals
• Estimate “Normal” loss burst –vs- “Abnormal” loss burst from each neighbor
• May produce false positives: beliefs that a node has failed when it is alive
Variance-Bound Detector
• Samples and estimates mean, stdev of loss burst
• Provides a guarantee on rate of false positives
• Based on one-sided Chebyshev’s inequality:
FPreq: goal for maximum false positive rate
Gi: number of consecutive missed heartbeats from neighbor i
HTOi: Heartbeat “TimeOut” (in hb) indicating failure1 reqi i i
req
FPHTO G
FP
2
1P
1XX X tt
Loose Bounds Lead to Long Timeouts
• Chebyshev’s inequality provides the worst case for the extremes
Example data set:PMF of loss burst durations from a neighbor
target FP rate = 5%mean = 4.61
stdev = 3.76
Heartbeat timeout =
22
Empirical-CDF Detector
• Samples gap lengths, maintains counters
• If we want FPreq =X%, calculate an HTO that has less than X% chance of occurring
0
[ ]
min s.t. (1 )[ ]
c
ij
i reqi
k
Count G j
HTO c FPCount G k
FPreq: goal for maximum false positive rateGi: number of consecutive missed heartbeats from neighbor iHTOi: Heartbeat “TimeOut” (in hb) indicating failureCount: vector of counters of occurrences of each gap length
Same Example, Better Bound
0
[ ]
min s.t. 0.95[ ]
c
ij
ik
Count G j
cCount G k
Example data set:CDF of loss burst durations from a neighbor
target FP rate = 5%
Heartbeat timeout =
12
• Bounds the timeout by the outliers within the requisite percentile
Testing the Tradeoffs on theExperimental Testbed
• Deployed 55-node testbed
• 16,076 square feet
• Implemented in TinyOS v1.4
• Runs on mica2 motes, crickets, EmStar
Failure Detector Comparison
• 45 minutes in duration
• Pick X nodes randomly (X {2,4,6,8})• Schedule their failure at a random time
sweep=30 seconds
hb=10 seconds
– Routing stability threshold = 1.5
• Run same failure schedule for all detector algos
• Routing = ETX-based tree
Contenders
• Direct-Heartbeat– Sends descendant’s liveness bitmaps to
the root, with aggregation a la TinyDB– If root hears no update about X, assumes
X is dead
• Variance-Bound, 1% FP target– Each node monitors its children
• Empirical-CDF, 1% FP target– Each node monitors its children
• Opportunistic Variance-Bound, 1% FP– Each node monitors any neighbor whose
packet loss < 30%
Evaluating Failure Detectors:False Positive Rate
Evaluating Failure Detectors:Detection Time
Explaining the Results
• Empirical-CDF has trouble during the learning phase
• The learning happens whenever a node gets new children – After another node has failed– After routing reconfiguration
• Opportunistic monitoring inflates the detection time– Neighbors with higher loss need more
time to achieve confidence in failure
Meeting the False Positive Guarantee
• How far can we push our FP target?
Tradeoffs and Limits of Guarantees
Tradeoffs and Limits of Guarantees
Take-Home Lessons
• 5x patience gets you 1000x confidence
• Neighborhood opportunism is a must to make failure detection practically useful in wireless environments
Part II:Collecting the Network Status
Aggregation
• TinyAggregation[TinyDB]
Our Collection Protocol
Memento
Aggregation
• Parent caches result
• Node sends an update only if its result or parent changes
Collection Protocol Summary
• Uses caching to suppress unnecessary communication
• Network-associative cache coherence is tricky, we propose mechanisms to maintain it
• Saves 80-90% bandwidth relative to state-of-the-art
• More sensitive to the rate of change in the update than to routing reconfigurations
Conclusions• Memento collection protocol is very
efficient in terms of bandwidth/energy, and well-suited for monitoring
• [In paper] Monitoring more neighbors does not lead to better performance
• New failure detectors, based on application needs
• Need to use neighborhood opportunism to get acceptably low false positive rate
End of Talk
• Questions?
Memento’s Approach to Cache Coherence• Children switch away?
snoop routing packets with parent address
• Node failures failure detectors clear the cache
• Parent cache out of sync? snoop parent updates, see if consistent with your results parents advertise a vector of result sequence #’s
• Finite cache slots for child results? augment routing to subscribe to parents
Collection Protocol Modules
Collection Protocol Evaluation
• Sensitivity to rate of switching parents?– Use ETX, vary the stability threshold
(the minimum improvement in “goodness” necessary to switch parents)
Collection Protocol Performance vs Routing Stability
Collection Protocol Evaluation
• Sensitivity to the rate of change in node results?– Fix the topology– Vary the fraction of nodes whose result
changes every sweep
Collection Protocol Performance vs Rate of Change of State
Status Collection Byte Overhead
Related Work
• Sympathy for the Sensor Network Debugger[Ramanathan, Kohler, Estrin | SENSYS ‘05]
• Nucleus[Tolle, Culler | EWSN ‘05]
• TiNA: Temporal Coherency-Aware In-Network Aggregation[Sharaf, Beaver, Labrinidis, Chryanthis | MobIDE ‘03]
• On Failure Detection Algorithms in Overlay Networks[Zhuang, Geels, Stoica, Katz | INFOCOM ‘05]
• Unreliable Failure Detectors[Chandra, Toueg | JACM ‘96][Gupta,Chandra,Goldszmidt | PODC ’01]
More Memento
• Symptom alerts: similar to liveness bitmaps– Watchdogs: core health metrics crossing
danger thresholds trigger alarms
• Logging to stable storage, to neighbors• Inspection:
– Cached alert aggregates serve as “breadcrumbs” on the way back to the sources, prune query floods
• Example app: detecting network partitioning– Node X dies, becomes point of fracture– Its parent P sends bitmap of children as
“partitioned”
Future Work
• State management in ad-hoc networks– Dynamic, yet stateful protocols– Working on: management of transfers of
large samples
• Static statistical properties of non-mobile deployments– Leverage models of group sampling to
reduce redundancy, provide load-balancing– Working on: statistical modeling, building
local models representative of global behavior
Simple Failure Detectors
• “Direct-Heartbeat”– A neighbor is alive if one or more of its
heartbeats is received since last sweep
– A neighbor has failed if failure detector has missed last K consecutive heartbeats
Dilemma: False Failure Alarms vs Detection Time
• Choose network-wide K given CDF of loss bursts:
Memento Performance Summary
• Intended for non-mobile deployments• When node status fluctuates,
approaches the costs of the cache-less scheme
• Results so far for a long, narrow tree– 6 hops max depth– 2.5 average children
Scope of the Opportunism
• Which neighbors are worth monitoring?
All Children
Picking Neighbors to Monitor
Pick neighbors whose heartbeat delivery probability > X
Tradeoffs in Failure Detection