Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Memento:Efficient Monitoringof Sensor Network HealthStanislav Rost and Hari BalakrishnanCSAIL, MIT

SECON, September 2006

“Sed quis custodiet ipsos custodes?”

“But who watches the watchmen?”

- Juvenal, Satire VI

Goals and Challenges of Monitoring

• Goals– Accuracy: minimize false alarms, maintenance– Timeliness: repair quickly, preserve sensor

coverage– Efficiency: in power, bandwidth, to help longevity

• Challenges– Packet loss: inherent to wireless medium– Dynamic routing topology:

adapts to link quality– Resource constraints:

internal monitoring is not the primary application

Memento Monitoring Suite Breakdown

• Failure detection:which nodes have failed?

• Collection protocol: gathering network-wide health status

• Watchdogs• Logging• Remote inspection

Typical Sensornet Framework

• Assume routing protocol,optimized by path metric to root

• Example metric: ETX– expected transmission

count to reliably transfer a packet root node

• Periodic communication– Protocol

advertisements– Collection sweeps

(1 per sweep period)

Data collection server

Gateway node

Sensornodes

Two Modules of Memento• Failure detectors

– Track communication of a subset of neighbors

– Detect failures– Form liveness beliefs

• Collection protocol– Send liveness

updates to the root– Aggregate along the

way, vote on status by aggregation

?

??

? ??

Fail-stop failure node permanently stops communicating (until reset or repaired)

Heartbeats periodic beacons of other protocols; or Memento’s own

Known period of transmission Packets include the source address

Liveness Update a bitmap s.t. bit k = 1 some node in subtree thinks node k is alive

Scope of Opportunistic Monitoring all? children? some?

Failure Set Calculation at gateway,[roster – live]

Part I: Failure DetectionProblem Statement

• Given a maximum false positive rate parameter

develop a scheme which minimizes detection time

• Using distributed failure detection:every node is a participantmay monitor a number of other nodes

Adaptive Failure Detectors

• Declare neighbor failed after an abnormally long gap in sequence of heartbeat arrivals

• Estimate “Normal” loss burst –vs- “Abnormal” loss burst from each neighbor

• May produce false positives: beliefs that a node has failed when it is alive

Variance-Bound Detector

• Samples and estimates mean, stdev of loss burst

• Provides a guarantee on rate of false positives

• Based on one-sided Chebyshev’s inequality:

FPreq: goal for maximum false positive rate

Gi: number of consecutive missed heartbeats from neighbor i

HTOi: Heartbeat “TimeOut” (in hb) indicating failure1 reqi i i

req

FPHTO G

FP

2

1P

1XX X tt

Loose Bounds Lead to Long Timeouts

• Chebyshev’s inequality provides the worst case for the extremes

Example data set:PMF of loss burst durations from a neighbor

target FP rate = 5%mean = 4.61

stdev = 3.76

Heartbeat timeout =

22

Empirical-CDF Detector

• Samples gap lengths, maintains counters

• If we want FPreq =X%, calculate an HTO that has less than X% chance of occurring

0

[ ]

min s.t. (1 )[ ]

c

ij

i reqi

k

Count G j

HTO c FPCount G k

FPreq: goal for maximum false positive rateGi: number of consecutive missed heartbeats from neighbor iHTOi: Heartbeat “TimeOut” (in hb) indicating failureCount: vector of counters of occurrences of each gap length

Same Example, Better Bound

0

[ ]

min s.t. 0.95[ ]

c

ij

ik

Count G j

cCount G k

Example data set:CDF of loss burst durations from a neighbor

target FP rate = 5%

Heartbeat timeout =

12

• Bounds the timeout by the outliers within the requisite percentile

Testing the Tradeoffs on theExperimental Testbed

• Deployed 55-node testbed

• 16,076 square feet

• Implemented in TinyOS v1.4

• Runs on mica2 motes, crickets, EmStar

Failure Detector Comparison

• 45 minutes in duration

• Pick X nodes randomly (X {2,4,6,8})• Schedule their failure at a random time

sweep=30 seconds

hb=10 seconds

– Routing stability threshold = 1.5

• Run same failure schedule for all detector algos

• Routing = ETX-based tree

Contenders

• Direct-Heartbeat– Sends descendant’s liveness bitmaps to

the root, with aggregation a la TinyDB– If root hears no update about X, assumes

X is dead

• Variance-Bound, 1% FP target– Each node monitors its children

• Empirical-CDF, 1% FP target– Each node monitors its children

• Opportunistic Variance-Bound, 1% FP– Each node monitors any neighbor whose

packet loss < 30%

Evaluating Failure Detectors:False Positive Rate

Evaluating Failure Detectors:Detection Time

Explaining the Results

• Empirical-CDF has trouble during the learning phase

• The learning happens whenever a node gets new children – After another node has failed– After routing reconfiguration

• Opportunistic monitoring inflates the detection time– Neighbors with higher loss need more

time to achieve confidence in failure

Meeting the False Positive Guarantee

• How far can we push our FP target?

Tradeoffs and Limits of Guarantees

Tradeoffs and Limits of Guarantees

Take-Home Lessons

• 5x patience gets you 1000x confidence

• Neighborhood opportunism is a must to make failure detection practically useful in wireless environments

Part II:Collecting the Network Status

Aggregation

• TinyAggregation[TinyDB]

Our Collection Protocol

Memento

Aggregation

• Parent caches result

• Node sends an update only if its result or parent changes

Collection Protocol Summary

• Uses caching to suppress unnecessary communication

• Network-associative cache coherence is tricky, we propose mechanisms to maintain it

• Saves 80-90% bandwidth relative to state-of-the-art

• More sensitive to the rate of change in the update than to routing reconfigurations

Conclusions• Memento collection protocol is very

efficient in terms of bandwidth/energy, and well-suited for monitoring

• [In paper] Monitoring more neighbors does not lead to better performance

• New failure detectors, based on application needs

• Need to use neighborhood opportunism to get acceptably low false positive rate

End of Talk

• Questions?

Memento’s Approach to Cache Coherence• Children switch away?

snoop routing packets with parent address

• Node failures failure detectors clear the cache

• Parent cache out of sync? snoop parent updates, see if consistent with your results parents advertise a vector of result sequence #’s

• Finite cache slots for child results? augment routing to subscribe to parents

Collection Protocol Modules

Collection Protocol Evaluation

• Sensitivity to rate of switching parents?– Use ETX, vary the stability threshold

(the minimum improvement in “goodness” necessary to switch parents)

Collection Protocol Performance vs Routing Stability

Collection Protocol Evaluation

• Sensitivity to the rate of change in node results?– Fix the topology– Vary the fraction of nodes whose result

changes every sweep

Collection Protocol Performance vs Rate of Change of State

Status Collection Byte Overhead

Related Work

• Sympathy for the Sensor Network Debugger[Ramanathan, Kohler, Estrin | SENSYS ‘05]

• Nucleus[Tolle, Culler | EWSN ‘05]

• TiNA: Temporal Coherency-Aware In-Network Aggregation[Sharaf, Beaver, Labrinidis, Chryanthis | MobIDE ‘03]

• On Failure Detection Algorithms in Overlay Networks[Zhuang, Geels, Stoica, Katz | INFOCOM ‘05]

• Unreliable Failure Detectors[Chandra, Toueg | JACM ‘96][Gupta,Chandra,Goldszmidt | PODC ’01]

More Memento

• Symptom alerts: similar to liveness bitmaps– Watchdogs: core health metrics crossing

danger thresholds trigger alarms

• Logging to stable storage, to neighbors• Inspection:

– Cached alert aggregates serve as “breadcrumbs” on the way back to the sources, prune query floods

• Example app: detecting network partitioning– Node X dies, becomes point of fracture– Its parent P sends bitmap of children as

“partitioned”

Future Work

• State management in ad-hoc networks– Dynamic, yet stateful protocols– Working on: management of transfers of

large samples

• Static statistical properties of non-mobile deployments– Leverage models of group sampling to

reduce redundancy, provide load-balancing– Working on: statistical modeling, building

local models representative of global behavior

Simple Failure Detectors

• “Direct-Heartbeat”– A neighbor is alive if one or more of its

heartbeats is received since last sweep

– A neighbor has failed if failure detector has missed last K consecutive heartbeats

Dilemma: False Failure Alarms vs Detection Time

• Choose network-wide K given CDF of loss bursts:

Memento Performance Summary

• Intended for non-mobile deployments• When node status fluctuates,

approaches the costs of the cache-less scheme

• Results so far for a long, narrow tree– 6 hops max depth– 2.5 average children

Scope of the Opportunism

• Which neighbors are worth monitoring?

All Children

Picking Neighbors to Monitor

Pick neighbors whose heartbeat delivery probability > X

Tradeoffs in Failure Detection

Documents

Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006