PlanetSeer: Internet Path Failure Monitoring and Characterization
in Wide-Area Services
Ming Zhang, Chi Zhang
Vivek Pai, Larry Peterson, Randy Wang
Princeton University
2
Motivation
• Routing anomalies are common on Internet Maintenance Power outage Fiber cut Misconfiguration …
• Anomalies can affect end-to-end performance Packet losses Packet delays Disconnectivities
3
Background
• Anomaly detection and diagnosis are nontrivial Asymmetric paths Failure information propagation Highly varied durations Limited coverage
4
Contributions
• New techniques for Anomaly detection Anomaly isolation Anomaly classification
• Large-scale study of anomalies Broad coverage High detection rate, low overhead Characterization of anomalies End-to-end effects Benefits to host service
5
Outline
• State of the Art• PlanetSeer Components
MonD – passive monitoring ProbeD – active probing
• Anomaly Analysis Loop-based anomaly Non-loop anomaly
• Bypassing Anomalies• Summary
6
State of the Art
• Routing messages BGP: AS-level diagnosis IS-IS, OSPF: Within single ISP
• Router/link traffic statistics SNMP, NetFlow: proprietary
• End-to-end measurement Ping, traceroute
7
End-to-End Probing
• All-pairs probes among n nodes O(n^2) measurement cost Not scalable as n grows
8
Key Observation
• Combine passive monitoring with active probing• Peer-to-Peer (P2P), Content Distribution Network
(CDN) Large client population Geographically distributed nodes Large traffic volume Highly diverse paths
• The traffic generated by the services reveals information about the network.
9
Our Approach
• Host service CDN
• Components Passive monitoring
Active probing
• Advantages Low overhead
Wide coverage
Client
A
C
B
R1
R2
10
MonD: Anomaly Detection
• Anomaly indicators Time-to-live (TTL) change
• Routing change n consecutive timeouts (n = 4 in current system)
• Idling period of 3 to 16 seconds
• most congestion periods < 220ms
11
ProbeD Operation
• Baseline probes When a new IP appears From local node
• Forward probes When a possible anomaly detected From multiple nodes (including local node)
• Reprobes At 0.5, 1.5, 3.5 and 7.5 hours later From local node
12
ProbeD Groups
• 353 nodes, 145 sites, 30 groups According to geographic location One traceroute per group
0123456789
1011
US (edu) US (non-edu)
Canada Europe Asia &MidE
Other
Num
ber
of G
roup
s
13
Estimating Scope
• Which routers might be affected? Routers which possibly change their next hops Traceroutes from multiple locations can narrow the
scope
ra rb rcrd
Client
Local ProbeD
RemoteProbeD
14
Path Diversity
• Monitoring Period: 02/2004 – 05/2004• Unique IPs: 887,521 • Traversed ASes: 10,090
0%
20%
40%
60%
80%
100%
Tier 1 Tier 2 Tier 3 Tier 4 Tier 5
Tie
r C
ov
erag
e
22 ASes
215 ASes
1392 ASes
1420 ASes
13872 ASes
Core
Edge
15
Confirming Anomalies
• Reported anomalies 2,259,588
• Conditions Loops Route change Partial unreachability ICMP unreachable
• Very conservative confirmation
Undecided 22%
Non-anomaly 66%
Anomaly 12%
16
Confirmed Anomaly Breakdown
• Confirmed anomalies 271,898 2 per minute 100x more
• Temp anomalies Inconsistent probes
Temp loop 1%
Path Change 44%
Fwd Outage 9%
Other Outage 23%
Persist Loop 7%
Temp Anomalies 16%
17
Scope of Loops
• How many routers or ASes are involved? Temp loops involve more routers than persistent loops 97% persistent loops and 51% temp loops contain 2
hops
0%10%20%30%40%50%60%70%80%90%
100%
2 3 4 5 6+
PersistentTemp
1% persist loops cross ASes
15% temp loops cross ASes
18
Distribution of Loops
• Many persistent loops in tier-3, few in tier-1• Worst 10% of tier-1 ASes – implications for
largest ISPs 20% traffic 35% persistent loops
0%5%
10%15%20%25%30%35%40%45%50%
Tier 1 Tier 2 Tier 3 Tier 4 Tier 5
PersistentTempTraffic
19
Duration of Persistent Loops
• How long do persistent loops last? Either resolve quickly or last for an extended period
0%
10%
20%
30%
40%
50%
60%
<0.5 hrs <1.5 hrs <3.5 hrs <7.5 hrs >= 7.5 hrs
20
Scope of Forward Anomalies
• How many routers or ASes are affected? 60% outages within 1 hops 75% outages and 68% changes within 4 hops
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12 14hops
frac
tion
change
outage78% outages within 2 ASes
57% changes within 2 ASes
21
Location of Forward Anomalies
• How close are the anomalies to the edges of the network? 44% outages at the last hop 72% outages and 40% changes within 4 hops
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4 5 6 7 8 9 10hops
frac
tion change
outage
22
Distribution of Forward Anomalies
• Which ASes are affected? Tier-1 ASes most stable Tier-3 ASes most likely to be affected
0%5%
10%15%20%25%30%35%40%45%50%
Tier 1 Tier 2 Tier 3 Tier 4 Tier 5
Change
Outage
Traffic
23
Overlay Routing
• Use alternate path when default path fails
source destination
intermediate
24
00.10.20.30.40.50.60.70.80.9
1
0.1 1 10 100bypass ratio
frac
tion
Bypassing Anomalies
• How useful is overlay routing for bypassing failures? Effective in 43% of 62,815 failures, lower than
previous studies 32% bypass paths inflate RTTs by more than a factor of
two
25
Summary
• Confirm 272,000 anomalies in 3 months• Persistent and temporary loops
Persistent loops narrower scope, either resolve quickly or last for a long time
• Path outages and changes Outages closer to edge, narrower scope
• Anomaly distribution Skewed. Tier-1 most stable. Tier-3 most problematic.
• Overlay routing Bypasses 43% failures, latency inflation
26
More Information
• In the paper More details about anomaly characteristics End-to-end impacts Classification methodology Optimizations to reduce overheads & improve
confirmation rate
• [email protected]• http://www.cs.princeton.edu/nsg/infoplane
27
Classifying Anomalies
• Temporary vs. persistent loops Whether exit loops at maximum hop
• Path changes vs. outages Changes: follow different paths to clients Outages: stop at intermediate hops
ProbeD
Client
28
Non-anomalies
• Non-anomalies Ultrashort anomalies Path-based TTL Aggressive timeout
29
Identifying Forward Outages
• Forward outages Route change ICMP dest unreachable Forward timeout Fwd
timeout35% Route
Change53%
ICMPUnreach
12%
30
Loop Effect on RTT
• How do loops affect RTTs? Loops can incur high latency inflation
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4RTT (seconds)
frac
tion
Persist loop
Persist loopnormal
Temp loop
Temp loopnormal
31
Loop Effect on Loss Rate
• How do loops affect loss rates? 65% temporary and 55% persistent loops preceded by
loss rates exceeding 30%
00.10.20.30.40.50.60.70.80.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1loss rate %
frac
tion Persistent
Temp
32
Forward Anomaly Effect on RTT
• How do forward anomalies affect RTTs? Outages and changes can incur latency inflation Outages have more negative effect on RTTs
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4RTT (seconds)
frac
tion
change
changenormal
outage
outagenormal
33
Forward Anomaly Effect on Loss Rate
• How do forward anomalies affect loss rates? 45% outages and 40% changes preceded by loss rates
exceeding 30%
00.10.20.30.40.50.60.70.80.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1loss rate %
frac
tion change
outage
34
Reducing Measurement Overhead
• Can we reduce the number of probes? 15 probes can achieve the same accuracy in 80% cases Flow-based TTL
00.10.20.30.40.50.60.70.80.9
1
0 5 10 15 20 25 30
Number of Probes
frac
tion
35
Traffic Breakdown By Tiers
Tier 324%
Tier 223%
Tier 120%
Tier 526%
Tier 47%