Being SMART About Failures: Assessing Repairs in Smart Homes

Being SMART About Failures: Assessing Repairs in Smart Homes

Krasimira Kapitanova, Enamul Hoque, John A. Stankovic, Kamin Whitehouse, Sang H. Son

University of Virginia, DGIST, Dept of Information and Communication Engineering

2013.02.25 Study GroupJunction

Outline Introduction Proposed Solution SMART Approach Detail Experimental Setup Results Discussions Conclusions


Introduction Smart home applications:

Home automation, energy efficiency, home security

Commercial Sensors: Inexpensive, wireless, battery-powered Reducing hardware and installation costs

Problems Do-it-yourself, low-cost sensors

Homes with hundreds of sensors had one sensor failure per day on average

Suffer from many type of faults: Break down, lose power

Non-fail-stop failure Sensor does not completely fail. It continues to

report values, but the meaning of the value changes or becomes invalid.

e.g. be dislodged, fall off and be re-mounted, covered by objects or blocked by an open door or re-arranged furniture.

The maintenance cost of fixing all such failures is prohibitive, and may negate any cost advantage of inexpensive hardware

and installation


Existing solution Detect and report fail-stop hardware failures

Repeatedly querying the nodes or checking for lost data

Þ Not non-fail-stop (still response) Detect non-fail-stop failures

Exploiting correlations between neighboring sensors

Þ bottom up approachÞ Use patterns in the raw sensor dataÞ (O) Homogeneous, periodic and continuous-valued

sensorsÞ (X) Heterogeneous, binary, and event-triggered

sensors

Proposed solution Simultaneous Multi-classifier Activity

Recognition Technique (SMART) Use top down application-level semantics to

detect, assess and adapt to sensor failures Detect non-fail-stop node failures:

Getting stuck at a value, node displacement, or node relocation

Runtime failure detection using multiple classifier instances That are trained to recognize the same set of activities

based on different subsets of sensors One fails -> affect a subset of the classifiers -> change the ratio of activity detection among the classifier

Proposed solution Once failure is detected

Adapts to the failure excluding the failed node creating a new classifier ensemble based on the

remaining subset of nodes Use data replay analysis to assess

whether the failure would have affected activity recognition in the past had the new classifier ensemble been used.

Y: dispatches a maintenance person to repair the failure

N: no maintenance is necessary

Comparison with bottom up method How accurately correlation-based techniques

can detect non-fail-stop failures in event-driven application? Activity recognition 43-day-long dataset from a two-resident home Failure -> movement failure, where one of the

kitchen motion sensor is accidentally moved to point in a different direction

Correlation-based results Cannot achieve failure detection accuracy higher

than 80% The accuracy decreases as the number of

consecutive testing days increases. Reason: Temporal correlation

between nodes => when two nodes fire together

But not looks at which activity is being performed

Application-level feature Top-down approach !


SMART Approach Train classifier instances for all possible combinations of node failures by holding those nodes out of the training set.

Analyze the effect of sensor failures on the classifiers’ performance

Update the classifier ensemble to contain classifiers that are trained for the failure.

If the new classifier ensemble can maintain the detection accuracy of the application above the specified severity threshold THS.

1 The training and use of the classifier ensemble

2The detection of non-fail-stop failures

3

The node failure severity analysis, which allows to decrease the number of maintenance dispatches.

4How SMART maintains high detection accuracy under failures

Using multiple simultaneous classifiers Single-node failures

|S|+1 classifier is used

Detect the occurrence of failure

Identify which nodes have failed by monitoring the relative behavior of the original classifier instance and that of the other |S| instances

Maintain high detection accuracy in the presence of failures

Failure detection By analyzing the relative performance of the

classifiers that had that node in its training set versus those classifiers that did not. e.g. sensor s fails, a change in the relative

behavior of C ( S , S-s ) and C ( S-s , S-s )

Calculate F-score of each of these classifiers with respect to the original classifier C0 to measure the similarity between their outputs. e.g. the F-score for a classifier Ci ( S-si , S )

Definitionactual class

(observation)

predicted class

(expectation)

tpCorrect result

fpUnexpected

result

fnMissing result

tnCorrect

absence of result

𝑡𝑝𝑡𝑝+ 𝑓𝑛

𝑡𝑝𝑡𝑝+ 𝑓𝑝Precision =

Recall =

F-score =

Failure detection Each of the |S| classifiers has an F-score

associated with it, forming F-score vector:

Characterize the behavior of the system when there are no failures

If a failure occurs, based on the severity of the failure, it might affect some or even all of the values in the F-score vector.

Failure detection

Since CA, CB, and C0 have all been trained with node C, their relative ratios will remain similar They are affected in a similar way.

CC was trained by holding node C out and will therefore change in behavior relative to classifier C0.

SMART can thus infer a failure has occurred and identify the cause of that failure.

Failure Recognition Step 1: Is there a failure or not?

FDC (failure detection classifier) Trained to distinguish between non-failure F-score

vectors and failure F-score vectors. Method: use historical data to generate both failure and

non-failure F-score and train the FDC.

Run-time: The system calculates the relative F-scores for the |S|

classifiers and builds a F-score vector. FDC determine if the F-score vector represents a

system that has a failure or not. => determine the failure detection accuracy

Artificially introducing failures in the historical data through

modifying the readings of the “failed” nodes.

Failure Recognition Step 2: Which node has failed?

FIC (failure identification classifier) Determines which of the nodes has failed. Is trained to distinguish between different node failures. By modifying the readings of the “failed” nodes in the

historical data. Evaluate the failure identification accuracy.

Reaction to small fluctuations Small fluctuations between F-score vectors

might occur even without failures. If residents of a home alter their normal activity

patterns Non-sever failures might cause very small changes

to the behavior of the classifiers. FDC -> false positive and false negative

To improve the accuracy of the failure detection: Increase the size of the historical data used for

training Increase the failure detection latency

Node failure severity analysis node failures are going to impact the

application differently based on the available level of redundancy in the system. e.g. cooking activity 2 sensors in the kitchen the star one close to the sink Different degree of impaction More sensors?

Node failure severity analysis Severity measurement when failure detected

Assume sensor s fails

To determine the effect of the failure on the appliction If it decreases the detection accuracy below the

severity threshold THS, and maintenance is dispatched every time a severe failure is detected.

THS, can be specified by per-activity severity

Maintaining detection accuracy under failure SMART uses a variety of classifier types

Naïve Bayesian Hidden Markov Model Switch to the classifier instance among all types

and training sets that performed best on the training data


Experimental Setup Evaluate SMART on 3 houses in 2 publicly

available activity recognition datasets. 1) CASAS datasets:

40 days of labeled data from over 60 sensors, 2-resident home

2) two single-resident homes: House A and House B House A: 25 days from 14 sensors House B: 13 days from 27 sensors

Experimental Setup Complex activities detection (more than 1

sensor)

Severity threshold THS = 0.3 The datasets do not contain failure information

All node failures in the experiments were simulated by modifying the values reported by the “failed” node.

“stuck at” failure: set the value of the failed node to 1 “misplacement” failure: replaced the data of the failed

sensor s with data from a sensor located at s’ new position


Results Detecting sensor node failures

Kitchen -> living room

Results Node failure severity assessment

WSU: 1/8, HA: 3/8, HB: 3/7 significant sensors

Results Evaluate SMART’s impact on the MTTF of the

application MTTF: number of time units after which the

detection accuracy falls below THS

Unlike the baseline, our approach determines that the application has filed not when the first node fails, but when the first severe node failure occurs.

Results Detailed view of the MTTF for prepare breakfast from the

WSU house.

Results Maintaining high activity recognition accuracy

under failures Compare to a classifier trained on all nodes in the

system, our approach achieves higher activity recognition accuracy in the presence of node failures

Results


Discussion Effect of node use on the failure detection

accuracy The failure detection accuracy of a node v.s. How frequently this node is used for a particular

activity node usage ratio per activity

The percentage of instances of that activity where node n is was used

A positive correlation between the importance of that node for the activity and how accurately we can detect the node’s failure

Discussion Effect of node use on the failure detection

accuracy The accuracy of detecting a “stuck at” failure for

the important nodes for the kitchen activities in House A.

When only the important nodes are considered, the failure detection accuracy increases dramatically

Discussion Limitations and future work

SMART cannot accurately detect the failures of sensors that are not frequently used in any of the activities.

SMART can be combined with state of the art health-monitoring systems, which can accurately detect fail-stop failures experienced by the rarely used nodes.

Assumption made: single-node failures not multiple failures

Plan to analyze how SMART’s failure detection accuracy is affected by multiple-node failures.


Conclusions SMART: a general failure detection,

assessment, and adaptation approach for smart home applications Decreases the number of maintenance dispatches

by 55% Triples the MTTF of the application on average Maintains sufficient activity recognition accuracy

in the presence of failures by dynamically updating the classifiers at runtime with over 85% accuracy

Improves the activity recognition accuracy under node failures by more than 15% on average.

Documents

Being SMART About Failures: Assessing Repairs in Smart Homes