Predicting and Bypassing End-to-End Internet Service Degradation
Anat Bremler-Barr Edith Cohen Haim Kaplan Yishay Mansour
Tel-Aviv University AT&T Labs Tel-Aviv University
Talk
Omer Ben-ShalomTel-Aviv University
Outline:
• Degradation– deviation from “normal” (minimum) RTT.
• Predicting Degradation:– Different Predictors
• Performance Evaluation:– Precision/recall methodology
• Suggested Application: Gateway selection
Motivating Application
AS 56Peering link
Peering link AS 123
Intelligent Routing device
?
• Gateway selection (Intelligent Routing device)• Choosing peering links
AS 12
AS 41
Data and Measurements: Sources
•Aciri (CA2)•AT&T (CA1)
•AT&T(NJ1)•Princeton (NJ2)
•Base Measurements from 4 different location (AS) simulated 4gateway:
California (CA): AT&T + ACIRINew Jersey (NJ): AT&T + Princeton
Data and Measurements: Destinations
•Obtaining a representative sets of web servers + weights(derived from proxy-log)
•Aciri (CA2)•AT&T(CA1)
•AT&T(NJ1)•Princeton (NJ2)
Data and Measurements: RTT
• Data: Weekly RTT (SYN) ( End to End (path+server)) Hourly measurements 35,124 servers Once-a-minute weighted sample measurements 100 servers
•Aciri (CA2)•AT&T(CA1)
•AT&T(NJ1)•Princeton(NJ2)
Degradation: Definition• Deviation from minimum recorded RTT (propagation delay)
• Discrete degradation levels 1-6.
Leveltime (ms)
150+
2+100
3+200
4+400
5+800
6+1600
Objective: Avoiding degradation?
• Attempt to reroute through a different gateway
• Two conditions have to hold
Need to be able to predict the failure from a gateway
Need to have a substitute gateway (low correlation between gateways)
• Blackout (consecutive degradation) through one gateway
Blackout durations• Longer duration, easier to predict.
• Majority of blackouts are short 1-3 consecutive points
• However, considerable fraction occurs in longer durations.
Long duration blackout
Gateways Correlation
• Gateways are correlated but often the correlation is not too strong
Gateways Correlation• Longer blackouts more likely to be shared
– failure closer to the server
• Majority of 2-gateways blackouts involved same-coast pairs
Building predictors
• For a given degradation level l.
• Prediction per IP.
• Input: Previous RTT Measurements for the IP-address.
• Output: probability for a failure
• Predict “failure” if probability > Ф
Precision \ Recall Methodology
Predicted degraded
Actual degraded
PrecisionPrecision= = Predicted degradedPredicted degraded
Actual degraded & Predicted DegradedActual degraded & Predicted Degraded
RecallRecall= = Actual degradedActual degraded
Actual degraded & Predicted DegradedActual degraded & Predicted Degraded
Precision-recall curve
• Sweep the threshold Ф in [0,1] to obtain a precision-recall curve.
• In other words, let P(t) the predicted failure probability at time t
])(| tat time failurePr[)(
] tat time failure|)(Pr[)(
tPprecision
tPrecall
What is important for prediction?
• Recency principle– The more recent RTTs are more important.
• Quantity Principle– The more measurements the higher the
accuracy.
Recency Principle : Importance• Test case: Single measurement predictor
– predict according to a measurement x-minute ago.– observe the change in the quality of the prediction.
15% different between using the last minute measurement or the 15 minutes ago measurement
Minute ago
NJ-2 failure level 6 recall(=precision)
NJ-1 failure level 3 recall(=precision)
10.330.5220.310.4940.290.4870.280.46
100.270.45150.260.44
Quantity Principle: Importance
• Test case: Fixed-Window-Count (FWC)– the prediction is the fraction of failures in the W most
recent measurements
By quantity we can achieve better precision for high recall
FWC 1FWC 5FWC 10FWC 50
Our predictors
– Exponential Decay – Polynomial Decay– Model based Predictors:
• VW-cover : Variable Window Cover algorithm
• HMM : Hidden Markov Model
Exponential-decay predictors
• The weight of each measurement is exponentially decreasing with its age by factor λ.
For consecutive measurements:
– Binary variable ft represents a failure at time t.
• In general,
t
t
Ht
tt
tt
Ht tft
'
'
'
' ')(ExpDecay
)1(ExpDecay)1()(ExpDecay tft t
Polynomial-decay predictors
t
t
Ht
Ht t
tt
ttft
'
' '
)'(
)'()(PolyDecay
• Exact computation required to maintaining the complete history.
• We approximated it.
The VW-Cover predictor
• Consists of a list of pairs
( a1 , b1) ( a2 , b2 ) …( an , bn )
• Predict a failure if exist i such that there are at least bi failures among previous ai
measurements
VW-Cover predictor: Building
• Build the predictor greedily to cover the failures.
• Use a learning set of measurements – Pick ( a1 , b1 ) to be the pair which maximizes
precision
– Pick ( ai , bi ) to be the pair which maximizes precision among uncovered failures
Hidden Markov Model
• Finite set states S (we use 3 states)
• Output probability as(0),as(1)
• Transition function, determines the probability distribution of the
next state.
• The probability for a failure:
Where ps(t) is the probability to
be at state s at time t. Ps(t) is updated according to the output of time t-1.
)()1()( spatHMM tSs
s
Experimental Evaluation
A recall 0.5 precision close to 0.9
Predictor Performance – Level 3
FWC10FWC 50ExpDecay 0.99ExpDecay 0.95VW-CoverHMM
Predictor Performance – Level 6
Degradation of level-6 are harder to predict: recall 0.5 precision 0.4
FWC10FWC 50ExpDecay 0.99ExpDecay 0.95VW-CoverHMM
Predictor Performance: Conclusion
• The best predictors in level 3 and 6 are
VW-cover and HMM
• But they only slightly outperform ExpDecay0.95 which is considerable simpler to implement
Gateway Selection
Best Gateway
Worst Gateway
OptimalExpDecay0.95VW-Cover
Static:
IP Gateway
1.15%3.29%0.08%0.52%0.49%0.86%
Level 6
Best-Gateway
Worst Gateway
OptimalExpDecay0.95VW-Cover
Static:
IP Gateway
3.45%5.77%0.45%1.56%1.50%2.41%
Level 3
Gateway Selection: Conclusion
• Active gateway selection resulted in 50% reduction in the degradation-rate with respect to best single gateway.
• Static gateway selection can avoid at most 25% of degradations.
• Again ExpDecay0.95 only slightly under perform the best predictor (VW-cover).
Performance of gateway selection as a function of recency
Correlation between coast
• Gateway selection on same-coast pair resulted only in 10% reduction. Chose independent gateways
NJ-2 NJ-1 CA-2 NJ-2
levelBest gateway
Best Predictor
Best gateway
Best Predictor
61.15%1.05%1.15%0.54%
33.45%3.05%3.45%1.78%
Controlling prediction overhead
• Type of measurements:– Active measurements :
• initiate probes (SYN,ping,HTTP request).• Scalability problem.
– Passive measurements:• collected on regular traffic
• Controlling the prediction overhead:– Using less-recent measurements– Active measurements only to small set of destinations,
which cover the majority of traffic.– Cluster destinations. The measurements of one destination
can be used to predict another.
Questions??
[email protected]@[email protected]@cs.tau.ac.il