Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example Jackson Mayo, James Brandt, Frank Chen, Vincent De

Quantifying Effectiveness of Failure Prediction and Response

in HPC Systems:Methodology and Example

Jackson Mayo, James Brandt, Frank Chen,Vincent De Sapio, Ann Gentile, Philippe Pébay,

Diana Roe, David Thompson, and Matthew Wong

Sandia National LaboratoriesLivermore, CA

28 June 2010

Workshop on Fault-Tolerance for HPC at Extreme Scale

SAND2010-4169C

Acknowledgments

• This work was supported by the U.S. Department of Energy, Office of Defense Programs

• Sandia is a multiprogram laboratory operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy under contract DE-AC04-94AL85000

Overview

• OVIS project goals and techniques

• Considerations for evaluating HPC failure prediction

• Example failure mode and predictor

• Example quantification of predictor effectiveness

The OVIS project aims to discoverHPC failure predictors

• Probabilistic failure predictioncan enable smarter resourcemanagement and checkpointing,and extend HPC scaling

• Challenges have limited progress on failure prediction– Complex interactions among resources and environment

– Scaling of data analysis to millions of observables

– Relative sparsity of data on failures and causes

– Need for actionable, cost-effective predictors

• OVIS (http://ovis.ca.sandia.gov) is open-source software for exploration and monitoring of large-scale data streams, e.g., from HPC sensors


– Robust, scalable infrastructure for data collection/analysis


– Analysis engines that learn statistical models from data and monitor for outliers (potential failure predictors)

Correlative

Bayesian

Graph clustering


– Flexible user interface for data exploration

Numeric data

Analysis pane

Model drop onto physical visualization

Log file search

Physical visualization

Evaluation of HPC failure predictionconfronts several challenges

• Lack of plausible failure predictors– Some previous studies focused on possible responses

without reference to a predictor

• Lack of response cost information– Diverse costs may need to be estimated (downtime,

hardware, labor)

• Complex temporal features of prediction and response– Cost of action or inaction depends on prediction timing

– Response to an alarm (e.g., hardware replacement) can alter subsequent events

– Historical data do not fully reveal what would have happened if alarms had been acted upon

Two general approaches offermetrics for prediction effectiveness

• Correlation of predictor with failure– Consider predictor as a classifier that converts available

observations into a statement about future behavior

– Simplest case: for a specific component and time frame, classifier predicts failure or non-failure (binary classification)

– Use established metrics for classifier performance

• Cost-benefit of response driven by predictor– Use historical data to estimate costs of acting on predictions

– More stringent test because even a better-than-chance predictor may not be worth acting on

– Requires choice of response and understanding of its impact on the system; may be relatable to classifier metrics

Classifier metrics assessability to predict failure

• Classifiers have been analyzed for signal detection, medical diagnostics, and machine learning

• Basic construct is “receiver operating characteristic” (ROC) curve– Binary classifiers have an adjustable threshold separating the

two possible predictions

– Interpretation in OVIS: How extreme an outlier is alarmed?

– Sweeping this threshold generates a tradeoff curve between false positives and false negatives

• Statistical significance of predictor can be measured• Any definition of failure/non-failure can be used, but

one motivated by costs is most relevant

Cost metrics assess ability tobenefit from failure prediction

• Given a predictor and a response, evaluate the net cost of using them versus not (or versus others)– Historical data alone may not answer this counterfactual

– Alternatives are real-world trials and dynamical models

• Classifier thresholds are subject to cost optimization• ROC curves allow reading off simple cost functions:

constant cost per false positive and per false negative• Realistic costs may not match such binary labels

– Is the cost-benefit of an alarm really governed by whether a failure occurs in the next N minutes?

• If costs are available for each historical event, they can be used to optimize thresholds directly

Real-world failure predictorillustrates evaluation issues

• Out of memory (OOM) condition has been a cause of job failure on Sandia’s Glory cluster

• Failure predictor: abnormally high memory usage during idle time (detectable > 2 hours before failure)

Real-world failure predictorillustrates evaluation issues

What is failure?Jobs terminate abnormally before system failure event(s)

Attribution?Event is far from the indicator/cause

Cost-benefit and ramifications?How to evaluate cost-benefit for a given action? What are the ramifications of a given action/inaction on a live system where playback is impossible?

Definitions and assumptions allowexample quantification of OOM predictor

• Classifier predicts whether a job will terminate as COMPLETED (non-failure) or otherwise (failure)

• Failure is predicted if memory usage (MU) on any job node during preceding idle time exceeds threshold

• Response is rebooting of any node with excess MU during idle time, thus clearing memory

• Cost of rebooting is 90 CPU-seconds– Does not include cycling wear or effect on job scheduling

• If a job failed, rebooting its highest-MU node during the preceding idle time would have saved it– Credit given for total CPU-hours of failed job

– Unrealistic assumption because not all failures are OOM

Example ROC curvemeasures prediction accuracy

• Predictions of job failure/non-failure are evaluated for various MU thresholds

• ROC curve shows better-than-chance accuracy– Area under curve is 0.562

vs. 0.5 for chance

– Statistical significance(p ~ 0.001) via comparison to synthetic data with no MU-failure correlation

• Validates ability to predict failure in this system

Lowest threshold(always alarm)

Highest threshold(never alarm)

False pos.

False neg.

Example net-benefit curvemeasures response effectiveness

• With stated assumptions, net benefit (saved jobs minus rebooting time) is monotonic with threshold– Rebooting cost is negligible

– Routine rebooting is optimal

• More realistic treatment would reduce net benefit– Not all failed jobs were

OOM or could be saved

– Additional rebooting costs

• Curve bent 80/20: smart reboot has potential value

Lowest threshold(always reboot)

Highest threshold(never reboot)

80% of benefit from20% of responses

Conclusion

• HPC failure prediction is a valuable ingredient to improve resilience and thus scaling of applications– System complexity makes predictors difficult to discover

• When a potential failure predictor is identified, quantification of effectiveness is challenging in itself– Classifier metrics evaluate correlation between predictor and

failure, but do not account for feasibility/cost of response

– Assessing practical value of predictor involves a response’s cost and impact on the system (often not fully understood)

• At least one predictor is known (idle memory usage)– Evaluation methodology applied to this example confirms

predictivity and suggests benefit from reboot response

http://ovis.ca.sandia.gov [email protected]

Documents

Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example Jackson Mayo, James Brandt, Frank Chen, Vincent De