Upload
sarah-lamb
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
11
Studying Black Holes on the Internet with Hubble
Ethan Katz-Bassett, Harsha V. Madhyastha, John P. John, Arvind Krishnamurthy, David Wetherall, Thomas Anderson University of WashingtonNSDI, April 2008
This work partially supported by
2
Global Reachability When an address is reachable from every
other address Most basic goal of Internet, especially BGP
“There is only one failure, and it is complete partition” Clarke, Design Philosophy of the DARPA Internet Protocols
Physical path BGP path traffic reaches Black hole: BGP path, but traffic persistently
does not reach
3
From use, seems to usually work Can we assume the protocols just make it work?
“Please try to reach my network 194.9.82.0/24 from your networks…. Kindly anyone assist.” Operator on NANOG mailing list, March 2008.
Does Internet give global reachability?
4
Does Internet give global reachability?
55
Hubble System Goal
In real-time on a global scale, automatically monitor long-lasting reachability problems and classify causes
66
Problem Seen by Hubble on Oct. 8, 2007
1. Target Identification – distributed ping monitors detect when the destination becomes unreachable
Fr:XTo:DPing?
Fr:DTo:XPing!
Fr:ZTo:DPing?
5:09 a.m.
5:11 a.m.
77
Problem Seen by Hubble on Oct. 8, 2007
1. Target Identification – distributed ping monitors
2. Reachability analysis – distributed traceroutes determine the extent of unreachability
5:13 a.m.
88
Problem Seen by Hubble on Oct. 8, 2007
1. Target Identification – distributed ping monitors2. Reachability analysis – distributed traceroutes3. Problem Classification
a) group failed traceroutes
99
Problem Seen by Hubble on Oct. 8, 2007
1. Target Identification – distributed ping monitors2. Reachability analysis – distributed traceroutes3. Problem Classification
a) group failed traceroutesb) spoofed probes to isolate direction of failure
Fr:XTo:DPing?
D to Y works!
Y to D fails!
D to Z works!
Z to D fails!
Fr:YTo:DPing?
Fr:DTo:YPing!
Fr:YTo:DPing?
Fr:DTo:YPing!
1010
Architecture: Detect Problem
Ping prefix to check if still reachable Every 2 minutes from PlanetLab Report target after series of failed pings
Maintain BGP tables from RouteViews feeds Allows IP AS mapping Identify prefixes undergoing BGP changes as targets
1111
Architecture: Assess Extent of Problem
Traceroutes to gather topological data Keep probing while problem persists Every 15 minutes from 35 PlanetLab sites
Analyze which traceroutes reach BGP table to map addresses to ASes Alias information to map interfaces to routers
1212
Architecture: Classify Problem
To aid operators in diagnosis and repair: Which ISP contains problem? Which routers? Which destinations?
1313
Architecture: Classify Problem
Real-time, automated classification Find common entity that explains substantial
number of failed traceroutes to a prefix Does not have to explain all failed traceroutes Not necessarily pinpointing exact problem
1414
Classifying with Current Topology Group failed/successful traceroutes by last
AS, router
Example: Router problem No probes reach P through router R Some reach through R’s AS 28% of classified problems
1515
Classifying with Historical Topology Daily probes from PlanetLab to all prefixes Gives baseline view of paths before problems
Example: “Next hop” problem Paths previously converged on router R Now terminate just before R
14% of classifiedproblems
16
Classifying with Direction Isolation Internet paths can be asymmetric Traceroutes only return routers on forward path
Might assume last hop is problem Even so, require working reverse path Hard to determine reverse path
Isolate forward from reverse to test individually Without node behind problem, use spoofed probes
Spoof from S to check forward path from S Spoof as S to check reverse path back to S
17
Classifying with Direction Isolation
Hubble deployment on RON employs spoofed probes 6 of 13 RON permit source spoofing PlanetLab does not support source spoofing
Example: Multi-homed provider problem Probes through Provider B fail Some reach through Provider A Like Cox/USC
6% of classified problems
1818
Architecture: Summary of Approach
Synthesis of multiple information sources Passive monitoring of route advertisements Active monitoring from distributed vantage points
Historical monitoring data to enable troubleshooting Topological classification and spoofing point at
problem
19
EvaluationTarget Identification How much of the Internet does Hubble monitor? Reachability Analysis What percentage of the various paths to a prefix
does Hubble analyze?Problem Classification How often can Hubble identify a common entity that
explains the failed paths to a prefix? How often does spoofing isolate the failure
direction?
For further evaluation, please see the paper.
2020
How much does Hubble monitor?Every 2 minutes: 110,000 prefixes 89% of Internet’s edge address space 92% of edge ASes Origin ASes for 99% of 14M BitTorrent users
21
Intel
Intel
Intel
Intel
What % of paths does Hubble monitor?
AT&TAT&T SprintSprint
CenicCenicGigapopGigapop AbileneAbilene
UWUW WSUWSU UTUT UMUM MITMIT
Tier 1
Transit
Stub
AT&TAT&T
GigapopGigapop CenicCenic
SprintSprint
PlanetLab’s restricted size and homogeneity limit uphill 90% of our failed traceroutes terminate within 2 AS hops
of prefix’s origin
Compare withBGP paths of447 RIPE peers(downhill ASes)
22
Intel
Intel
Intel
Intel
What % of paths does Hubble monitor?
AT&TAT&T SprintSprint
CenicCenicGigapopGigapop AbileneAbilene
UWUW WSUWSU UTUT UMUM MITMIT
Tier 1
Transit
Stub
AT&TAT&T
GigapopGigapop CenicCenic
SprintSprint
BGP ASes: { AT&T, Sprint, Gigapop, Cenic, Intel }Also on Traceroutes: { Sprint, Gigapop, Cenic, Intel }Coverage for Intel prefix: 4 of 5 downhill ASes = 80%
Compare withBGP paths of447 RIPE peers(downhill ASes)
23
Intel
Intel
Intel
Intel
What % of paths does Hubble monitor?
AT&TAT&T SprintSprint
CenicCenicGigapopGigapop AbileneAbilene
UWUW WSUWSU UTUT UMUM MITMIT
Tier 1
Transit
Stub
AT&TAT&T
GigapopGigapop CenicCenic
SprintSprint
Overall for prefixes monitored by Hubble For >60% of prefixes, traverse ALL downhill RIPE ASes For 90% of prefixes, traverse more than half the ASes
Compare withBGP paths of447 RIPE peers(downhill ASes)
2424
How often can Hubble classify? 9 classes currently
Based on topology Point to an AS and/or router
Results from first week of February 2008 Automatically classified 375,775/457,960
(82%) of problems as they occurred
2525
How often does spoofing work?When a RON path works and another does not: Isolate 68% of failures from spoofing sources 47% forward, 21% reverse
2626
How long do black holes last?
3 week study starting September 17, 2007 31,000 black holes involving 10,000 prefixes 20% lasted at least 10 hours! 68% were cases of partial reachability
2727
How long do black holes last?
3 week study starting September 17, 2007 31,000 black holes involving 10,000 prefixes 20% lasted at least 10 hours! 68% were cases of partial reachability
Partial reachability:
Can’t be just hardware failure
Configuration/ policy
28
Other Measurement Results Can’t find problems using only BGP updates
Only 38% of problems correlate with RouteViews updates Multi-homing may not give resilience against failure
100s of multi-homed prefixes had provider problems like COX/USC, and ALL occurred on path TO prefix
Inconsistencies across an AS For an AS responsible for partial reachability, usually some
paths work and some do not Path changes accompany failures
3/4 router problems are with routers NOT on baseline path
2929
Conclusions and Future Work Hubble: working real-time system
Lots of reachability problems, some long lasting Baseline/ fine-grained data enable problem
classification Spoofing to isolate direction of path failures
http://hubble.cs.washington.eduUses iPlane, MaxMind, Google Maps
3030
Thanks!
http://hubble.cs.washington.edu
3131
Long term prospects for spoofing?Support for spoofing: No complaints about our spoofed probes Can receive spoofed probes at PlanetLab PlanetLab support in future kernels? Router vendor talking to us about router
support for measurements
Alternatives to spoofing: Traceroute servers behind problems End-hosts behind problems
3232
Comparison to PlanetSeer [OSDI ‘04] Most similar system Passively monitors CoDeeN clients, probes on
anomalies Different and complimentary analysis
PlanetSeer Clients that connected
within 15 minutes 43% edge ASes
(sum over 3 months) Not problems that
prevent access to CDN
Hubble All prefixes every 2
minutes 92% edge Ases
(every 2 minutes) All partial or complete
reachability problems
3333
Characteristics of Problems of Interest Routable prefix present in BGP tables Persistent through 2 rounds of probes Routing infrastructure failures
Not simply end-system/end-network failure Judgments based on connectivity to origin AS
Not simply source problem Monitor if less than 90% of vantages reach Based on 4 months of probes to 110K prefixes
3434
How well does Hubble work?Scale: 89% of the Internet’s edge address space 92% of edge ASes Origin ASes for 99% of 14M BitTorrent users
Effectiveness: Finds 85% of black holes, 95% of those that last at least 1 hr
[compared to pervasive approach]
Cost: 5.5% of the probes required by pervasive approach
3535
Does spoofing work?
When 3+ spoofing RON nodes fail to reach: Isolate all failed paths in 61% of cases 42% forward, 16% reverse, 3% mixed For 95% of cases, all paths isolate to same
direction
3636
Provider(s) Unreachable
• No probes reach even the provider(s) of Origin AS
• Probes fail in AS upstream
3% of classified problems (1-13% at any point in time)
3737
Single-homed Origin AS Down• No probes reach single-homed Origin AS
• Some reach its provider
17% of classified problems (4-37% at any point in time)
3838
Multi-homed Origin AS Down
• No probes reach multi-homed Origin AS
• Some reach its provider(s)
9% of classified problems (2-30% at any point in time)
3939
Provider AS Problem for Multi-Homed• Probes through Provider B fail to reach P
• Some reach through Provider A
6% of classified problems (1-17% at any point in time)
4040
Non-Provider AS Problem
• Probes through Non-Provider C fail
• Some reach through other ASes
17% of classified problems (1-37% at any point in time)
4141
Router Problem on Known Path • Last hop router R was seen on recent paths reaching P
• No probes reach P through R
• Some reach through R’s AS
7% of classified problems (1-40% at any point in time)
Historical
Traceroutes
4242
Router Problem on New Path
• Last hop router R not seen on recent paths reaching P
• No probes reach P through R
• Some reach through R’s AS
21% of classified problems (1-40% at any point in time)
4343
Next Hop Problem on Known Paths • No last hop router or AS explains problem
• Paths previously converged on router R
• Now terminate just before R
14% of classified problems (1-39% at any point in time)
4444
Topological classification resultsOf ones we classify: Overall (range over time)
1. Provider(s) unreachable: 3% (1-13%)
2. Single-homed origin AS down: 17% (4-37%)
3. Multi-homed origin AS down: 9% (2-30%)
4. Provider AS problem for multi-homed origin AS: 6% (1-17%)
5. Non-provider AS problem: 17% (1-37%)
6. Router problem on old path: 7% (1-40%)
7. Router problem on new path: 21% (1-40%)
8. Next hop problem on known paths: 14% (1-39%)
9. Prefix unreachable: 22% (7-79%)