Improving Service Availability of Cloud Systems by ... · RQ3: How effective is the proposed...

Improving Service Availability of Cloud Systems by Predicting Disk Error

Yong Xu*, Kaixin Sui, Qingwei Lin, Keceng Jiang,

Wenchi Zhang , Jian-Guang Lou, Dongmei Zhang Randolph Yao, Yingnong Dang, Murali Chintalapati

Hongyu Zhang

Microsoft Research Asia, Beijing, China Microsoft Azure, Redmond, USA

Peng Li

The University of Newcastle, Australia Nankai University, China

USENIX ATC, July 12, 2018

< 26 Sec

>99.999%

High availability remains one of the top priorities of cloud systems.

Motivation – Towards High Cloud Service Availability

Motivation – Impact of Disk Error on Cloud Service Availability

• Hardware issue is one of the top reasons of VM downtime

• Disk error contributes most to Hardware issue

• Disk error may result in irreversible data loss disaster

Unplanned VM downtime is highly painful to customers.

HEALTHY UNHEALTHYRISKY

Allocate new VM to

predicted healthier disksLive Migration

Improve VM availability by early prediction of disk errors and guide Live Migration (moving VMs to healthy node without disconnection to the client or application.

> 99.999 %

State-of-the-art

Complete failureDisk SMART dataprediction

Methodology model venue

Statistical Threshold settingFAST

USENIX ATC

Unsupervised Clustering

Markov chain

Supervised

classification

Neural NetworkDecision Tree

Random Forest

“Self-Monitoring, Analysis and Reporting Technology”

Predicting disk errors in industrial settings is difficult.

State-of-the-art

Complete failureDisk SMART dataprediction

Methodology model venue

Statistical Threshold settingFAST

USENIX ATC

Unsupervised Clustering

Markov chain

Supervised

classification

Neural NetworkDecision Tree

Random Forest

“Self-Monitoring, Analysis and Reporting Technology”

Predicting disk errors in industrial settings is difficult.

No real-production adoption reported in existing work.

• VM downtime occurs far before disk complete failure

• Existing prediction flow(cross-validation guided) goes wrong

• Training with extremely imbalanced health labels of disks is difficult

• …

Why predicting disk errors in real production is difficult?

The proof of the pudding is in the eating.

Insights beyond laboratory work.

Complete failure

VM down by disk errors (I/O latency, VM not responding, etc)

Problem 1 – Predicting complete failure is not helpful to prevent VM downtime

VM downtime occurs far before complete failure of disks.

SMART data

System-level Signals (earlier signals of disk errors)

Disk Errors(latency, timeout, sector error, etc)

Complete failureDisk SAMRT data

prediction model

Solution - Incorporate system-level features

System-level signals manifest earlier symptoms of disk errors.

Problem 2- Cross-Validation Guided prediction goes wrong

Training data Prediction data

trainingvalidation

First iteration

Second iteration

Third iteration

Cross Validation

State-of-the-art do prediction in cross-validation guided way,

not applicable in real production scenario.

CV training Prediction

CV-guided model lead to Low result in real online prediction

Good result of CV-guided evaluation.

Problem 2- Cross-Validation guided prediction goes wrong

Experiment result shows good result in CV evaluation, but poor result in real online prediction.

Environment change at t

validationTraining

Prone to highlight the features(i.e. one-off outage) that

are essentially not that predictive

no/different changesin the future

Eg. Rack 3 encounter outage at time t.

Fundamentally, training phase of Cross-Validation is not applicable for disk error prediction.

Environment change at t

validationTraining

Prone to highlight the features(i.e. one-off outage) that

are essentially not that predictive

no/different changesin the future

Eg. Rack 3 encounter outage at time t.

Fundamentally, training phase of Cross-Validation is not applicable for disk error prediction.

Errors of different disks don’t happen independently

in complex cloud systems.

Solution – Online prediction guided way

Prediction dataTraining data

Feature selection

Modelvalidation

Online prediction guided

Solution – Online prediction guided way

Training data

Will prune the features that related to the change

Prediction data

Validation Training

timeOnline prediction guided

Strictly separate training and validation set by time.

Cross-Validation guided vs. Online prediction guided

Online-prediction guided outperforms.

• VM downtime occur before disk complete failure

• Existing prediction flow(cross-validation guided) go wrong

Fault : good~3 : 10,000

prone to predict all to be good low recall

Problem 3 – Extremely imbalanced dataset

Extremely small portion of fault samples leads to low recall using common classification model.

Rethinking the problem

Ranking instead of Classification

Solution - Cost-sensitive ranking model

Live Migration

New VMs allocated to healthier disks

Predicted worst

Predicted healthier

Predicted risky

Ranking Model

Best cutting point r = argmin(Cost= Cost1*FP + Cost2* FN)

False predictions, both false positive(FP) and false negative(FN), bring cost to real cloud system.

Evaluation

• Dataset• Real dataset from Azure

• Training: October 2017

• Testing: 3 parts divided from November 2017

• Healthy disks: faulty disks is ~10,000 : 3

• Setup• Data store and process: Microsoft COSMOS

• Ranking algorithm: FastTree implemented by Microsoft AzureML

• Windows Server 2012 with Intel CPU E5-4657L v2 @2.40GHz 2.40 with 1.0 TB Memory

• Evaluation metrics• True Positive Rate(TPR) = TP/(TP + FN), under 0.1% False Positive Rate(FPR) = FP/(FP + TN)

Result

RQ1: How effective is the proposed approach in predicting disk errors?

42.11% cost(with Cost1 = 3, Cost2 = 1) reduction than RandomForest, than 11.5% SVM.

Result

RQ2: How effective is the proposed OnlinePrediction-guided way?

Result

RQ3: How effective is the proposed ranking model?

Conclusion

• Point out the CrossValidation-guided prediction does not work for real online prediction in industry settings, and develop an OnlinePrediction-guided approach

• Leverage system-level signals in additional to SMART data in disk fault prediction

• Propose a ranking model to conquer the issue of extremely data imbalance

• Deployed to large scale industrial cloud system, Microsoft Azure, and significantly improved Azure service availability

Improving Service Availability of Cloud Systems by ... · RQ3: How effective is the proposed...

Documents

Leave-one-out crossvalidationprema.mf.uni-lj.si/files/CindexLOO_EMRIBS_b69.pdf · Leave-one-out crossvalidation favors inaccurate estimators Section for Clinical Biometrics Simulation

2sRanking-CNN: A 2-stage ranking-CNN for …2.2.1 1st-stage Ranking-CNN Before explaining 1st-stage ranking-CNN, let’s brieﬂy explain how ranking-CNN works. Ranking-CNN was proposed

RQ1 RQ2 RQ3 Discussion - census.gov · RQ3 RESULTS • Multivariate Findings – RQ 3 –Figure 9: Registration by Election Type •Comparisons were statistically different in two

RQ3 Bronze Age Sourcebook

Olympic Ranking M-68kg March 2017 (Includes Events …mastkd.com/.../03/20170316x_MasTKD_Olympic-Ranking-M-68kg-Mar… · WTF Official Ranking Report Olympic Ranking M-68kg March

HPAH ranking = 36 of 66 polygons PCB ranking = 20 of 66 ...€¦ · Mercury ranking = 13 of 66 polygons HPAH ranking = 36 of 66 polygons PCB ranking = 20 of 66 polygons TBT ranking

A Statistical Approach to Speed Up Ranking/Re-Ranking

Ranking Ranking Rules - Boston College

PAGE RANKING

A ranking is a ranking, a ranking is not …

World Taekwondo Official Ranking Report World Ranking M

TF-Ranking - Michael Benderskybendersky.github.io/res/TF-Ranking-ICTIR-2019.pdf · TF-Ranking enables faster iterations over ideas to build ranking-appropriate modules An early attempt

Vacuumgenerators - Festo · RQ3 Push-inconnectorQS8 RI2 FemalethreadM5 RI4 FemalethreadGx RI5 FemalethreadG¼ RO1 SilencerUO,minimalresistance RO2 Silencer UOM,minimal resistance

Ranking 96, 3,92 · 2017-10-11 · ranking 70, 4,25 ranking 66, 4,29 ranking 50, 4,44 ranking 51, 4,44 ranking 47, 4,50 ranking 33, 4,71. consejo nacional de competitividad . consejo

IKSA World Ranking - Men...IKSA World Ranking - Men Current World Ranking / IKSA - International Kicksled & Scooter Association / / jasper@iksaworld.com The IKSA World Ranking is based

Week 2, Lecture 4 - Does my model work? Crossvalidation ... · 39 / 44 Example-GeneExpressionSignatures Figure4.Mostprognostictranscriptionalsignalsarecorrelated with meta-PCNA. A)

Resampling: CrossValidation EdVul · 2021. 1. 7. · 4 0.00 15690.46 Note: 10totaldatapoints,splittinginto5train,5test. Overand overagain. (Moreonthislater) Leave-one-outcross-validation

RQ Chapters 31 & 32 Societies at the Crossroads & The ...images.pcmac.org/.../ChapelHillHigh/Uploads/Forms/rq3… · Web viewImproved the country’s infrastructure by building

Ranking 2015

New Ranking Tweets by Labeled and Collaboratively Selected Pairs … · 2020. 9. 9. · supervised ranking models such as S3VM ranking model and collaboratively S3VMs ranking model