48
1 Network Measurements in Overlay Networks Richard Cameron Craddock School of Electrical and Computer Engineering Georgia Institute of Technology

Network Measurements in Overlay Networks

Embed Size (px)

DESCRIPTION

Network Measurements in Overlay Networks. Richard Cameron Craddock School of Electrical and Computer Engineering Georgia Institute of Technology. Outline. Resilient Overlay Networks Best-Path vs. Multi-Path Overlay Routing Measuring the Effect of Internet Path Faults on Reactive Routing. - PowerPoint PPT Presentation

Citation preview

Page 1: Network Measurements in Overlay Networks

1

Network Measurements in Overlay NetworksRichard Cameron Craddock

School of Electrical and Computer Engineering

Georgia Institute of Technology

Page 2: Network Measurements in Overlay Networks

2

Outline Resilient Overlay Networks Best-Path vs. Multi-Path Overlay Routing Measuring the Effect of Internet Path Faults

on Reactive Routing

Page 3: Network Measurements in Overlay Networks

3

Resilient Overlay NetworksD. Andersen, H. Balakrishnan, F. Kaashoek, and R. MorrisProc. 18th ACM SOSPOctober 2001

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 4: Network Measurements in Overlay Networks

4

Resilient Overlay Networks RONs seek to quickly detect and respond to network

failures Network nodes participate in a limited size overlay

network Overlay nodes cooperate with one another to forward data

on behalf of any other nodes in the RON RON detects problems by aggressively probing the paths

connecting its nodes RON nodes exchange information about the quality of

paths among themselves, and build forwarding tables based on a variety of path metrics Latency, Packet Loss, and Available Throughput

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 5: Network Measurements in Overlay Networks

5

Resilient Overlay Networks Goals Failure detection and recovery in less than 20

seconds Tighter integration of routing and path

selection with the application Expressive policy routing

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 6: Network Measurements in Overlay Networks

6

Active Probing RON probes every

other node PROBE_INTERVAL plus a random jitter of 1/3 PROBE_INTERVAL

A probe not returned in PROBE_TIMEOUT is considered loss

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 7: Network Measurements in Overlay Networks

7

Link-State Dissemination RON nodes disseminate their performance

metrics to the other nodes every ROUTING_INTERVAL

This information is sent over the RON overlay

The only time that a RON node has incomplete information about any other node is when it is completely cut off from the Overlay

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 8: Network Measurements in Overlay Networks

8

Outage Detection On the loss of a probe, several consecutive probes

spaced by PROBE_TIMEOUT are sent out If OUTAGE_THRESH probes elicit no response the

path is considered “dead” If even one probe gets a response then high

frequency probing is cancelled Paths experiencing outages are rated on their packet

loss history

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 9: Network Measurements in Overlay Networks

9

Latency and Loss Rate Latency is the round trip time calculated from the

probes Latency = A * Latency + (1-A) * New Sample A is chosen to be 0.9 Overall latency is the SUM of the individual virtual link

latencies Loss Rate is the average of the last k = 100 probe

samples If losses are assumed independent then the overall path

loss rate is the PRODUCT of the individual virtual link loss rates

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 10: Network Measurements in Overlay Networks

10

Throughput

(2) 5.1

prttscore

Throughput is calculated using (2) p is the one way packet loss probability

Estimated as half of the calculated two-way packet loss probability rtt is the end-to-end round trip time

Throughput cannot be aggregated across virtual links In order to simplify the selection of throughput optimized paths only one

intermediate node is considered An indirect path is only chosen if it improves throughput by 50%

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 11: Network Measurements in Overlay Networks

11

Experiment The raw measurement data consists of probe packets To probe each RON node independently repeated the

following steps Pick a random node j Pick a probe-type from one of {direct, latency, loss} using

round-robin. Send probe to j Delay for a random interval between 1 and 2 seconds

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 12: Network Measurements in Overlay Networks

12

Results Two distinct datasets

RON1 64 hours between 3/21/2001 and 3/23/2001 12 nodes with 132 distinct paths Traverses 36 different AS’s and 74 distinct inter-AS

links RON2

85 hours between 5/7/2001 and 5/11/2001 16 nodes with 240 distinct paths Traverses 50 AS’s and 118 different AS links

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 13: Network Measurements in Overlay Networks

13

Results

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 14: Network Measurements in Overlay Networks

14

Overcoming Path Outages

A RON win occurred when internet loss was >= p% and RON loss was < p%

10 complete communication outages of which RON routed around all of them

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 15: Network Measurements in Overlay Networks

15

Loss Rate Improved loss rate by

more than 0.05 more than 5% of the time in RON1

RON can make loss rates worse too

Improved loss rate by more than 0.04 more than 5% of the time in RON2

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 16: Network Measurements in Overlay Networks

16

Handling Packet Floods Three hosts connected in a

triangle Indirect routing is possible

through the third node but not preferable

Flood attack beginning at 5s RON recovered in 13s Non-RON doesn’t recover

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 17: Network Measurements in Overlay Networks

17

Latency

RON reduces communication latency in many cases 11% saw improvements of 40 ms or more in RON1 8.2% saw improvements of 40 ms or more in RON2

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 18: Network Measurements in Overlay Networks

18

TCP Throughput RON’s throughput-

optimizing router does not attempt to change paths unless it obtains a 50% improvement in throughput

5% of samples doubled their throughput

2% increased their throughput by more then 5 times

9 by a factor of 10

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 19: Network Measurements in Overlay Networks

19

Conclusions Resilient overlay networks can greatly improve the reliability

of the Internet RON was able to overcome 100%(RON1) and 60%(RON2)

of the several hundred observed outages RON takes 18 seconds on average to detect and recover from

a fault RON can substantially improve loss rate, latency and TCP

throughput Forwarding packets via at most one intermediate node is

sufficient for fault recovery and latency improvements

ANDERSEN, D. G., BALAKRISHNAN, H., KAASHOEK, M. F., AND MORRIS, R. Resilient Overlay Networks. In Proc. 18th ACM SOSP (Banff, Canada, Oct. 2001), pp. 131–145.

Page 20: Network Measurements in Overlay Networks

20

Best-Path vs. Multi-Path Overlay Routing

D. Andersen, A. Snoeren, and H. BalakrishnanIMCMiami, FL, October 2003.

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 21: Network Measurements in Overlay Networks

21

Best-Path vs. Multi-Path Overlay Routing Best-path and multi-path routing techniques

have been proposed to reduce packet loss These techniques are compared in terms of

loss rate and latency reduction This comparison is made in the context of an

overlay network

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 22: Network Measurements in Overlay Networks

22

Best-Path vs. Multi-Path Overlay Routing Multi-Path Routing

Packets are duplicated and sent on different paths through overlay

Reactive Routing Overlay nodes constantly measure the paths

between themselves using probes Packets are sent on either the direct path or

forwarded via a sequence of other overlay nodes

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 23: Network Measurements in Overlay Networks

23

Routing Methods Direct:

Single packet using the direct path Loss:

Loss optimized reactive routing Lat:

Latency optimized reactive routing

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 24: Network Measurements in Overlay Networks

24

Routing Methods Direct rand:

2 redundant multi-path routing First packet is sent directly Second packet is sent randomly

Lat Loss: 2 redundant multi-path routing with reactive routing First packet is sent on latency optimized link Second packet is sent on loss optimized link

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 25: Network Measurements in Overlay Networks

25

Routing Methods Direct direct

2-redundant direct routing with back-to-back packets on the same path

DD 10 ms Direct direct with 10ms delay between packets

DD 20 ms Direct direct with 20ms delay between packets

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 26: Network Measurements in Overlay Networks

26

Method Nodes periodically initiate one or two request

packets to a target Each request has a random 64-bit identifier which is

logged along with send and receive times Nodes cycle through the different request types Targets are chosen randomly Nodes delay for a random period between 0.6 and

1.2 seconds

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 27: Network Measurements in Overlay Networks

27

Base Network Statistics

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 28: Network Measurements in Overlay Networks

28

Packet Loss Rate

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 29: Network Measurements in Overlay Networks

29

Packet Loss Rate

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 30: Network Measurements in Overlay Networks

30

Conditional Loss Probability and Latency

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 31: Network Measurements in Overlay Networks

31

Conclusion There is loss and failure independence in the

Internet 40% of observed losses were avoidable

The benefits of multi-path routing can be achieved with direct duplication 10 or 20 ms delay between packets

Reactive and redundant routing can work in concert to reduce loss 45% decrease in packet loss rate

D. G. Andersen, A. C. Snoeren, and H. Balakrishnan, "Best-Path vs Multi-Path Overlay Routing," IMC 2003.

Page 32: Network Measurements in Overlay Networks

32

Measuring the Effect of Internet Path Faults on Reactive RoutingNick Feamster, David G. Andersen, Hari Balakrishnan, and M. Frans KaashoekSigMetricsSan Diego, June 2003

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 33: Network Measurements in Overlay Networks

33

Measuring the Effect of Internet Path Faults on Reactive Routing Where do failures appear? How long do failures last? How well do failures correlate with BGP

routing instability?

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 34: Network Measurements in Overlay Networks

34

Data Collection Based on the analysis of data collected for one year

on a test bed of 31 hosts Geographically as well as topologically diverse test bed Paths between these hosts traverse more than 50% of the

well-connected ASs on the internet Data includes:

Active probes between hosts Traceroutes BGP messages collected at 8 locations

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 35: Network Measurements in Overlay Networks

35

Active Probing An active probe consists of a request packet from the initiator to a target

and reply packet from the target to the initiator Each probe has a 32 bit ID that is logged along with send and receive times A central monitoring machine aggregates logs Post processing finds all probes received within 60 minutes of when they are

sent Each host independently initiates a probe to a random target and then

sleeps between 1 and 2 seconds Mean time between probes on a particular path is 30s With a 95% probability each path is probed at least once every 80s

Failures are defined as 3 or more consecutive lost probes Limits the time resolution of failure detection to a few minutes

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 36: Network Measurements in Overlay Networks

36

Loss-triggered traceroutes Path failure indicated by the active prober initiates a

single traceroute Traceroute is limited to 30 hops The last reachable IP address is considered point of

failure The failure of a traceroute could be due to either the

forward or reverse path One-way reachability from active probes ensures that the

traceroute measurement corresponds to failure on the forward path

Measurement hosts periodically push traceroute logs to the central monitoring machine

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 37: Network Measurements in Overlay Networks

37

Network Depth Estimation Want to determine if a failure occurs near an end

host of in the middle of the network Assign an estimated network depth to each link

based on its connectivity to other network nodes Links between routers and measurement nodes have a

network depth of 0 Any edge that connects a 0 depth router to other routers

has a depth of 1, and so on Edges that can receive more than one value, get assigned

the smaller value By computing the depth of all links, the depth at

which a traceroute fails can be estimated

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 38: Network Measurements in Overlay Networks

38

Inferring AS Topology Inferring AS topology requires:

Mapping interfaces to routers, alias resolution Assigning routers to ASs

Alias resolution Based on Rocketfuel’s “Ally” technique A pair of IP addresses is candidate for alias resolution if

they both have the same next or previous hop in a traceroute

For each candidate pair the alias resolution test is performed 100 times

If the test is positive 80% or more of the time, the two IP addresses are assigned to the canonical ID of the router

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 39: Network Measurements in Overlay Networks

39

Inferring AS Topology Routers are assigned to ASs based on the AS’s address space If a router has addresses from more then one AS it is assigned

to the AS with the most addresses and considers the router a border router

Routers that cannot be identified in the above manner are assigned by neighbor router votes If the majority of the links from a router lead into one AS, we

assign the router to that AS If the router has links to multiple ASs it is considered a border

router Routers that cannot be assigned in the above manner are

assigned by hand using traceroute information

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 40: Network Measurements in Overlay Networks

40

BGP Data Collection 8 nodes in the test bed collected BGP

messages using Zebra 0.92a Configured to see only BGP messages that cause

a change in the border router’s choice of best route

Monitors observe most BGP messages relevant to routing stability

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 41: Network Measurements in Overlay Networks

41

Failure Location

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 42: Network Measurements in Overlay Networks

42

Failure Location

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 43: Network Measurements in Overlay Networks

43

Failure Length

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 44: Network Measurements in Overlay Networks

44

Failures after RON

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 45: Network Measurements in Overlay Networks

45

Correlating Failures and BGP

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 46: Network Measurements in Overlay Networks

46

Correlating Failures and BGP

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 47: Network Measurements in Overlay Networks

47

Conclusions Failures are more likely to appear within an AS than on the

boundary 70% of observed failures last less than 5 min, 90% shorter

then 15 min Failures near the core are more likely to coincide with BGP

messages Failures typically precede failures by 4 minutes

RON can typically route around 50% of path failures 20% of the failures masked by RON were preceded by at

least one BGP message Suggesting that reactive routing can be improved using BGP

instability as an indicator of path failures.

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).

Page 48: Network Measurements in Overlay Networks

48

Discussion Can passive measurements be used for a

RON? How do you guarantee that you have enough

data? How do you handle old data?

How practical are RONs? Are the performance gains worth the overhead?

FEAMSTER, N., ANDERSEN, D., BALAKRISHNAN, H., AND KAASHOEK, M. F. Measuring the effects of Internet path faults on reactive routing. In Proc. Sigmetrics (San Diego, CA, June 2003).