Download ppt - Predicting and Bypassing End-to-End Internet Service Degradation

Predicting and Bypassing End-to-End Internet Service Degradation

Anat Bremler-Barr Edith Cohen Haim Kaplan Yishay Mansour

Tel-Aviv University AT&T Labs Tel-Aviv University

Talk

Omer Ben-ShalomTel-Aviv University

Outline:

• Degradation– deviation from “normal” (minimum) RTT.

• Predicting Degradation:– Different Predictors

• Performance Evaluation:– Precision/recall methodology

• Suggested Application: Gateway selection

Motivating Application

AS 56Peering link

Peering link AS 123

Intelligent Routing device

?

• Gateway selection (Intelligent Routing device)• Choosing peering links

AS 12

AS 41

Data and Measurements: Sources

•Aciri (CA2)•AT&T (CA1)

•AT&T(NJ1)•Princeton (NJ2)

•Base Measurements from 4 different location (AS) simulated 4gateway:

California (CA): AT&T + ACIRINew Jersey (NJ): AT&T + Princeton

Data and Measurements: Destinations

•Obtaining a representative sets of web servers + weights(derived from proxy-log)

•Aciri (CA2)•AT&T(CA1)

•AT&T(NJ1)•Princeton (NJ2)

Data and Measurements: RTT

• Data: Weekly RTT (SYN) ( End to End (path+server)) Hourly measurements 35,124 servers Once-a-minute weighted sample measurements 100 servers

•Aciri (CA2)•AT&T(CA1)

•AT&T(NJ1)•Princeton(NJ2)

Degradation: Definition• Deviation from minimum recorded RTT (propagation delay)

• Discrete degradation levels 1-6.

Leveltime (ms)

150+

2+100

3+200

4+400

5+800

6+1600

Objective: Avoiding degradation?

• Attempt to reroute through a different gateway

• Two conditions have to hold

Need to be able to predict the failure from a gateway

Need to have a substitute gateway (low correlation between gateways)

• Blackout (consecutive degradation) through one gateway

Blackout durations• Longer duration, easier to predict.

• Majority of blackouts are short 1-3 consecutive points

• However, considerable fraction occurs in longer durations.

Long duration blackout

Gateways Correlation

• Gateways are correlated but often the correlation is not too strong

Gateways Correlation• Longer blackouts more likely to be shared

– failure closer to the server

• Majority of 2-gateways blackouts involved same-coast pairs

Building predictors

• For a given degradation level l.

• Prediction per IP.

• Input: Previous RTT Measurements for the IP-address.

• Output: probability for a failure

• Predict “failure” if probability > Ф

Precision \ Recall Methodology

Predicted degraded

Actual degraded

PrecisionPrecision= = Predicted degradedPredicted degraded

Actual degraded & Predicted DegradedActual degraded & Predicted Degraded

RecallRecall= = Actual degradedActual degraded

Actual degraded & Predicted DegradedActual degraded & Predicted Degraded

Precision-recall curve

• Sweep the threshold Ф in [0,1] to obtain a precision-recall curve.

• In other words, let P(t) the predicted failure probability at time t

])(| tat time failurePr[)(

] tat time failure|)(Pr[)(

tPprecision

tPrecall

What is important for prediction?

• Recency principle– The more recent RTTs are more important.

• Quantity Principle– The more measurements the higher the

accuracy.

Recency Principle : Importance• Test case: Single measurement predictor

– predict according to a measurement x-minute ago.– observe the change in the quality of the prediction.

15% different between using the last minute measurement or the 15 minutes ago measurement

Minute ago

NJ-2 failure level 6 recall(=precision)

NJ-1 failure level 3 recall(=precision)

10.330.5220.310.4940.290.4870.280.46

100.270.45150.260.44

Quantity Principle: Importance

• Test case: Fixed-Window-Count (FWC)– the prediction is the fraction of failures in the W most

recent measurements

By quantity we can achieve better precision for high recall

FWC 1FWC 5FWC 10FWC 50

Our predictors

– Exponential Decay – Polynomial Decay– Model based Predictors:

• VW-cover : Variable Window Cover algorithm

• HMM : Hidden Markov Model

Exponential-decay predictors

• The weight of each measurement is exponentially decreasing with its age by factor λ.

For consecutive measurements:

– Binary variable ft represents a failure at time t.

• In general,

t

t

Ht

tt

tt

Ht tft

'

'

'

' ')(ExpDecay

)1(ExpDecay)1()(ExpDecay tft t

Polynomial-decay predictors

t

t

Ht

Ht t

tt

ttft

'

' '

)'(

)'()(PolyDecay

• Exact computation required to maintaining the complete history.

• We approximated it.

The VW-Cover predictor

• Consists of a list of pairs

( a1 , b1) ( a2 , b2 ) …( an , bn )

• Predict a failure if exist i such that there are at least bi failures among previous ai

measurements

VW-Cover predictor: Building

• Build the predictor greedily to cover the failures.

• Use a learning set of measurements – Pick ( a1 , b1 ) to be the pair which maximizes

precision

– Pick ( ai , bi ) to be the pair which maximizes precision among uncovered failures

Hidden Markov Model

• Finite set states S (we use 3 states)

• Output probability as(0),as(1)

• Transition function, determines the probability distribution of the

next state.

• The probability for a failure:

Where ps(t) is the probability to

be at state s at time t. Ps(t) is updated according to the output of time t-1.

)()1()( spatHMM tSs

s

Experimental Evaluation

A recall 0.5 precision close to 0.9

Predictor Performance – Level 3

FWC10FWC 50ExpDecay 0.99ExpDecay 0.95VW-CoverHMM

Predictor Performance – Level 6

Degradation of level-6 are harder to predict: recall 0.5 precision 0.4

FWC10FWC 50ExpDecay 0.99ExpDecay 0.95VW-CoverHMM

Predictor Performance: Conclusion

• The best predictors in level 3 and 6 are

VW-cover and HMM

• But they only slightly outperform ExpDecay0.95 which is considerable simpler to implement

Gateway Selection

Best Gateway

Worst Gateway

OptimalExpDecay0.95VW-Cover

Static:

IP Gateway

1.15%3.29%0.08%0.52%0.49%0.86%

Level 6

Best-Gateway

Worst Gateway

OptimalExpDecay0.95VW-Cover

Static:

IP Gateway

3.45%5.77%0.45%1.56%1.50%2.41%

Level 3

Gateway Selection: Conclusion

• Active gateway selection resulted in 50% reduction in the degradation-rate with respect to best single gateway.

• Static gateway selection can avoid at most 25% of degradations.

• Again ExpDecay0.95 only slightly under perform the best predictor (VW-cover).

Performance of gateway selection as a function of recency

Correlation between coast

• Gateway selection on same-coast pair resulted only in 10% reduction. Chose independent gateways

NJ-2 NJ-1 CA-2 NJ-2

levelBest gateway

Best Predictor

Best gateway

Best Predictor

61.15%1.05%1.15%0.54%

33.45%3.05%3.45%1.78%

Controlling prediction overhead

• Type of measurements:– Active measurements :

• initiate probes (SYN,ping,HTTP request).• Scalability problem.

– Passive measurements:• collected on regular traffic

• Controlling the prediction overhead:– Using less-recent measurements– Active measurements only to small set of destinations,

which cover the majority of traffic.– Cluster destinations. The measurements of one destination

can be used to predict another.

Questions??

[email protected]@[email protected]@cs.tau.ac.il

mailto:[email protected]