43
Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Embed Size (px)

Citation preview

Page 1: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Memento:Efficient Monitoringof Sensor Network HealthStanislav Rost and Hari BalakrishnanCSAIL, MIT

SECON, September 2006

Page 2: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

“Sed quis custodiet ipsos custodes?”

“But who watches the watchmen?”

- Juvenal, Satire VI

Page 3: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Goals and Challenges of Monitoring

• Goals– Accuracy: minimize false alarms, maintenance– Timeliness: repair quickly, preserve sensor

coverage– Efficiency: in power, bandwidth, to help longevity

• Challenges– Packet loss: inherent to wireless medium– Dynamic routing topology:

adapts to link quality– Resource constraints:

internal monitoring is not the primary application

Page 4: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Memento Monitoring Suite Breakdown

• Failure detection:which nodes have failed?

• Collection protocol: gathering network-wide health status

• Watchdogs• Logging• Remote inspection

Page 5: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Typical Sensornet Framework

• Assume routing protocol,optimized by path metric to root

• Example metric: ETX– expected transmission

count to reliably transfer a packet root node

• Periodic communication– Protocol

advertisements– Collection sweeps

(1 per sweep period)

Data collection server

Gateway node

Sensornodes

Page 6: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Two Modules of Memento• Failure detectors

– Track communication of a subset of neighbors

– Detect failures– Form liveness beliefs

• Collection protocol– Send liveness

updates to the root– Aggregate along the

way, vote on status by aggregation

?

??

? ??

Fail-stop failure node permanently stops communicating (until reset or repaired)

Heartbeats periodic beacons of other protocols; or Memento’s own

Known period of transmission Packets include the source address

Liveness Update a bitmap s.t. bit k = 1 some node in subtree thinks node k is alive

Scope of Opportunistic Monitoring all? children? some?

Failure Set Calculation at gateway,[roster – live]

Page 7: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Part I: Failure DetectionProblem Statement

• Given a maximum false positive rate parameter

develop a scheme which minimizes detection time

• Using distributed failure detection:every node is a participantmay monitor a number of other nodes

Page 8: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Adaptive Failure Detectors

• Declare neighbor failed after an abnormally long gap in sequence of heartbeat arrivals

• Estimate “Normal” loss burst –vs- “Abnormal” loss burst from each neighbor

• May produce false positives: beliefs that a node has failed when it is alive

Page 9: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Variance-Bound Detector

• Samples and estimates mean, stdev of loss burst

• Provides a guarantee on rate of false positives

• Based on one-sided Chebyshev’s inequality:

FPreq: goal for maximum false positive rate

Gi: number of consecutive missed heartbeats from neighbor i

HTOi: Heartbeat “TimeOut” (in hb) indicating failure1 reqi i i

req

FPHTO G

FP

2

1P

1XX X tt

Page 10: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Loose Bounds Lead to Long Timeouts

• Chebyshev’s inequality provides the worst case for the extremes

Example data set:PMF of loss burst durations from a neighbor

target FP rate = 5%mean = 4.61

stdev = 3.76

Heartbeat timeout =

22

Page 11: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Empirical-CDF Detector

• Samples gap lengths, maintains counters

• If we want FPreq =X%, calculate an HTO that has less than X% chance of occurring

0

[ ]

min s.t. (1 )[ ]

c

ij

i reqi

k

Count G j

HTO c FPCount G k

FPreq: goal for maximum false positive rateGi: number of consecutive missed heartbeats from neighbor iHTOi: Heartbeat “TimeOut” (in hb) indicating failureCount: vector of counters of occurrences of each gap length

Page 12: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Same Example, Better Bound

0

[ ]

min s.t. 0.95[ ]

c

ij

ik

Count G j

cCount G k

Example data set:CDF of loss burst durations from a neighbor

target FP rate = 5%

Heartbeat timeout =

12

• Bounds the timeout by the outliers within the requisite percentile

Page 13: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Testing the Tradeoffs on theExperimental Testbed

• Deployed 55-node testbed

• 16,076 square feet

• Implemented in TinyOS v1.4

• Runs on mica2 motes, crickets, EmStar

Page 14: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Failure Detector Comparison

• 45 minutes in duration

• Pick X nodes randomly (X {2,4,6,8})• Schedule their failure at a random time

sweep=30 seconds

hb=10 seconds

– Routing stability threshold = 1.5

• Run same failure schedule for all detector algos

• Routing = ETX-based tree

Page 15: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Contenders

• Direct-Heartbeat– Sends descendant’s liveness bitmaps to

the root, with aggregation a la TinyDB– If root hears no update about X, assumes

X is dead

• Variance-Bound, 1% FP target– Each node monitors its children

• Empirical-CDF, 1% FP target– Each node monitors its children

• Opportunistic Variance-Bound, 1% FP– Each node monitors any neighbor whose

packet loss < 30%

Page 16: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Evaluating Failure Detectors:False Positive Rate

Page 17: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Evaluating Failure Detectors:Detection Time

Page 18: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Explaining the Results

• Empirical-CDF has trouble during the learning phase

• The learning happens whenever a node gets new children – After another node has failed– After routing reconfiguration

• Opportunistic monitoring inflates the detection time– Neighbors with higher loss need more

time to achieve confidence in failure

Page 19: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Meeting the False Positive Guarantee

• How far can we push our FP target?

Page 20: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Tradeoffs and Limits of Guarantees

Page 21: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Tradeoffs and Limits of Guarantees

Page 22: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Take-Home Lessons

• 5x patience gets you 1000x confidence

• Neighborhood opportunism is a must to make failure detection practically useful in wireless environments

Page 23: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Part II:Collecting the Network Status

Aggregation

• TinyAggregation[TinyDB]

Page 24: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Our Collection Protocol

Memento

Aggregation

• Parent caches result

• Node sends an update only if its result or parent changes

Page 25: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Collection Protocol Summary

• Uses caching to suppress unnecessary communication

• Network-associative cache coherence is tricky, we propose mechanisms to maintain it

• Saves 80-90% bandwidth relative to state-of-the-art

• More sensitive to the rate of change in the update than to routing reconfigurations

Page 26: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Conclusions• Memento collection protocol is very

efficient in terms of bandwidth/energy, and well-suited for monitoring

• [In paper] Monitoring more neighbors does not lead to better performance

• New failure detectors, based on application needs

• Need to use neighborhood opportunism to get acceptably low false positive rate

Page 27: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

End of Talk

• Questions?

Page 28: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Memento’s Approach to Cache Coherence• Children switch away?

snoop routing packets with parent address

• Node failures failure detectors clear the cache

• Parent cache out of sync? snoop parent updates, see if consistent with your results parents advertise a vector of result sequence #’s

• Finite cache slots for child results? augment routing to subscribe to parents

Page 29: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Collection Protocol Modules

Page 30: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Collection Protocol Evaluation

• Sensitivity to rate of switching parents?– Use ETX, vary the stability threshold

(the minimum improvement in “goodness” necessary to switch parents)

Page 31: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Collection Protocol Performance vs Routing Stability

Page 32: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Collection Protocol Evaluation

• Sensitivity to the rate of change in node results?– Fix the topology– Vary the fraction of nodes whose result

changes every sweep

Page 33: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Collection Protocol Performance vs Rate of Change of State

Page 34: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Status Collection Byte Overhead

Page 35: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Related Work

• Sympathy for the Sensor Network Debugger[Ramanathan, Kohler, Estrin | SENSYS ‘05]

• Nucleus[Tolle, Culler | EWSN ‘05]

• TiNA: Temporal Coherency-Aware In-Network Aggregation[Sharaf, Beaver, Labrinidis, Chryanthis | MobIDE ‘03]

• On Failure Detection Algorithms in Overlay Networks[Zhuang, Geels, Stoica, Katz | INFOCOM ‘05]

• Unreliable Failure Detectors[Chandra, Toueg | JACM ‘96][Gupta,Chandra,Goldszmidt | PODC ’01]

Page 36: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

More Memento

• Symptom alerts: similar to liveness bitmaps– Watchdogs: core health metrics crossing

danger thresholds trigger alarms

• Logging to stable storage, to neighbors• Inspection:

– Cached alert aggregates serve as “breadcrumbs” on the way back to the sources, prune query floods

• Example app: detecting network partitioning– Node X dies, becomes point of fracture– Its parent P sends bitmap of children as

“partitioned”

Page 37: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Future Work

• State management in ad-hoc networks– Dynamic, yet stateful protocols– Working on: management of transfers of

large samples

• Static statistical properties of non-mobile deployments– Leverage models of group sampling to

reduce redundancy, provide load-balancing– Working on: statistical modeling, building

local models representative of global behavior

Page 38: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Simple Failure Detectors

• “Direct-Heartbeat”– A neighbor is alive if one or more of its

heartbeats is received since last sweep

– A neighbor has failed if failure detector has missed last K consecutive heartbeats

Page 39: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Dilemma: False Failure Alarms vs Detection Time

• Choose network-wide K given CDF of loss bursts:

Page 40: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Memento Performance Summary

• Intended for non-mobile deployments• When node status fluctuates,

approaches the costs of the cache-less scheme

• Results so far for a long, narrow tree– 6 hops max depth– 2.5 average children

Page 41: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Scope of the Opportunism

• Which neighbors are worth monitoring?

Page 42: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

All Children

Picking Neighbors to Monitor

Pick neighbors whose heartbeat delivery probability > X

Page 43: Memento: Efficient Monitoring of Sensor Network Health Stanislav Rost and Hari Balakrishnan CSAIL, MIT SECON, September 2006

Tradeoffs in Failure Detection