56
w w w .caida.o The Impact of Router Outages on the AS-Level Internet 1 Matthew Luckie* - University of Waikato Robert Beverly - Naval Postgraduate School *work started while at CAIDA, UC San Diego SIGCOMM 2017, August 24th 2017

The Impact of Router Outages on the AS-Level Internetconferences.sigcomm.org/sigcomm/2017/files/program/ts-11-3-outages… · The Impact of Router Outages on ... Challenges in topology

Embed Size (px)

Citation preview

w w w .caida.org

The Impact of Router Outages on the AS-Level Internet

1

Matthew Luckie* - University of Waikato Robert Beverly - Naval Postgraduate School

*work started while at CAIDA, UC San Diego

SIGCOMM 2017, August 24th 2017

Internet Resilience

2

CE

PE

PE

CE

PE

CE

CE: Customer Edge PE: Provider Edge

Where are the Single Points of Failure?

Example #A Example #B

Internet Resilience

3

CE

PE

PE

Where are the Single Points of Failure?

If the CE router fails,the network is disconnected,

so the CE router is aSingle Point of Failure (SPoF)

CE: Customer Edge PE: Provider Edge

Example #A

Internet Resilience

4

CE

PE

CE

Where are the Single Points of Failure?

If the CE router fails,the network has an

alternate path available,so the CE router is NOT a

Single Point of Failure (SPoF)

CE: Customer Edge PE: Provider Edge

Example #B

Internet Resilience

5

CE

PE

CE

Where are the Single Points of Failure?

If the PE router fails,the customer network is

disconnected, so the PE router is a Single Point of Failure (SPoF)

CE: Customer Edge PE: Provider Edge

Example #B

Challenges in topology analysis• Prior approaches analyzed static AS-level and router-level

topology graphs,

- e.g.: Nature 2000

• Important AS-level and router-level topology might be invisible to measurement, such as backup paths,

- e.g: INFOCOM 2002

• A router that appears to be central to a network’s connectivity might not be

- e.g.: AMS 2009

6

What we did

Large-scale (Internet-wide) longitudinal (2.5 years) measurement study to characterize prevalence of Single Points of Failure (SPoF):

1. Efficiently inferred IPv6 router outage time windows

2.Associated routers with IPv6 BGP prefixes

3.Correlated router outages with BGP control plane

4.Correlated router outages with data plane

5.Validated inferences of SPoF with network operators

7

What we did

8

Identified IPv6 router interfaces from traceroute

83K to 2.4M interfaces from CAIDA’sArchipelago traceroute measurements

What we did

9

probed router interfaces to infer outage windows

We used a single vantage point located at CAIDA,UC San Diego for the duration of this study

What we did

10

Central counter : 9290

What we did

10

Central counter : 9290 Central counter : 9291

9290

T1: 9290

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

9291

Central counter : 9292T1: 9290T2: 9291

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293

9292

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293

9293

Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294

9294

Central counter : 9295

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294

Central counter : 9295

Reboot!

Central counter : 1

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294

Central counter : 9295 Central counter : 1

1

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1

Central counter : 2

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294

Central counter : 9295 Central counter : 1T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1

Central counter : 2 Central counter : 3

2

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2

What we did

10

Central counter : 9290 Central counter : 9291T1: 9290

Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292

Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294

Central counter : 9295 Central counter : 1T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1

Central counter : 2 Central counter : 3T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2

3

Central counter : 4T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3

What we did

11

probed router interfaces to infer outage windows using IPID

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3

Infer a reboot when time series of values returned froma router is discontinuous, indicating router was restarted

Outage Window

Why IPv6 fragment IDs?

• IPv4 Fragment IDs:

- 16 bits, bursty velocity: every packet requires unique ID

- At 100Mbps and 1500 byte packets, Nyquist rate dictates4 second probing interval

• IPv6 Fragment IDs:

- 32 bits, low velocity: IPv6 routers rarely send fragments

- We average 15 minute probing interval

12

What we did

13

correlated routers with prefixes using traceroute paths

What we did

14

2001:db8:1::/48

2001:db8:2::/48correlated routers with prefixes

using traceroute paths

Ark VP

Ark VP

50-60 Ark VPstraceroute every

routed IPv6prefix every day

What we did

14

2001:db8:1::/48

2001:db8:2::/48correlated routers with prefixes

using traceroute paths

Ark VP

Ark VP

50-60 Ark VPstraceroute every

routed IPv6prefix every day

What we did

15

2001:db8:2::/48computed distance of

router from AS announcingnetwork

Ark VP

2001:db8:1::/48

0(CE)

1 (PE)

2

CE: Customer Edge PE: Provider Edge

What we did

16

2001:db8:1::/48

2001:db8:2::/48correlated router outage windows

with BGP control plane

0(CE)

What we did

17

2001:db8:1::/48

2001:db8:2::/48correlated router outage windows

with BGP control plane

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3

Outage Window

What we did

18

2001:db8:1::/48

2001:db8:2::/48correlated router outage windows

with BGP control plane

T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3

Outage Window

2001:db8:2::/48T5.2: Peer-1 WT5.2: Peer-2 WT5.3: Peer-3 WT5.3: Peer-4 WT5.8: Peer-3 AT5.8: Peer-2 AT5.8: Peer-1 AT5.8: Peer-4 A

RouteViews

classified impact on BGP according to observed activity overlapping with inferred outage

What we did

• Complete Withdrawal: all peers simultaneously withdrew route for at least 70 seconds

- Single Point of Failure (SPoF)

• Partial Withdrawal: at least one peer withdrew route for at least 70 seconds, but not all did

• Churn: BGP activity for the prefix

• No Impact: No observed BGP activity for the prefix

19

Data Collection SummaryWhat we did

• Probed IPv6 routers at ~15 minute intervals from18 Jan 2015 to 30 May 2017 (approx. 2.5 years)

• 149,560 routers allowed reboots to be detected

• We inferred 59,175 (40%) rebooted at least once,750K reboots in total

20

CDF

0.2

0.4

0.6

0.8

1

1 10 100Number of Outages

0

What we found• 2,385 (4%) of routers that rebooted (59K) we inferred

to be SPoF for at least one IPv6 prefix in BGP

• Of SPoF routers, we inferred 59% to be customer edge router ; 8% provider edge; 29% within destination AS

• No covering prefix for 70% of withdrawn prefixes

- During one-week sample, covering prefix presence during withdrawal did not imply data plane reachability

• IPv6 Router reboots correlated with IPv4 BGP control plane activity

21

Limitations• Applicability to IPv4 depends on router being dual-stack• Requires IPID assigned from a counter

- Cisco, Huawei, Vyatta, Microtik, HP assign from counter- 27.1% responsive for 14 days assigned from counter

• Router outage might end before all peers withdraw route- Path exploration + Minimum Route Advertisement Interval

(MRAI) + Route Flap Dampening (RFD) • Complex events: multiple router outages but one detected

- We observed some complex events and filtered them out

22

Validation

23

Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39✔ = Validated Inference✘ = Incorrect Inference? = Not Validated

Validation

24

Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39

Challenging to get validation data: operators oftencould only tell us about the last reboot

Validation

25

Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39

No falsely inferred reboots: we correctly observedthe last known reboot of each router

Validation

26

Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39

We did not detect some SPoFs

Data Collection Summary

27

83K

41.8K15.2K

46.5K

(b)~1.1M

79.8K(a) (c)

23.5K

10KJan ’17Jul ’16

AllIncrementing

Jan ’16Jul ’15Jan ’15

3M

1M

100K

30KNum

ber o

f Int

erfa

ces

PPS List Unresponsive(a) 100 Static 83K 12-24 hours(b) 225 Static 1.1M 12-24 hours(c) 200 Dynamic, ~2.4M 7-14 days

Control: six hours prior to inferred outages, Feb 2015Correlating BGP/router outages

28

2 1 0 −1 −2 −3 −4 −5Distance of Router from Destination AS (IP hops)

Frac

tion

of R

eboo

t/Pre

fix P

airs

0

0.1

Churn

0.2

0.3

0.4

0.5

12 11 10

Partial Withdrawal

9 8 7 6 5 4 3

Complete WithdrawalInsideDest.

ASOutsideDest. AS

During the inferred outages, Feb 2015Correlating BGP/router outages

29

2 1 0 −1 −2 −3 −4 −5Distance of Router from Destination AS (IP hops)

Frac

tion

of R

eboo

t/Pre

fix P

airs

0

0.1

Churn

0.2

0.3

0.4

0.5

12 11 10

Partial Withdrawal

9 8 7 6 5 4 3

Complete WithdrawalInsideDest.

ASOutsideDest. AS

BGP Prefix Withdrawals: SPoF

30

max

0.2

0.4

0.6

0.8

1

1min

5min

15min

30min

1hr

2hr

4hr

8hr

16hr

CDF

Complete Withdrawal Duration

min

0

44% less than 5 minutes, suggestive of router maintenance or router crash

SPoF prefixes mostly single homed

31

Router hopdistance

PECE

1

−2

0−1

1

Prefix announced through a single upstream

Prefix announced through multiple upstreams

0.20

23

Fraction of Population

0.4 0.6 0.8

−3

EspeciallySPoFs outsidedestination AS,as expected

Impact on IPv4 prefixes in BGP

32

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Cum

ulat

ive

Frac

tion

Rou

ter O

utag

es

Withdrawn Peers/Advertising Peers

Before OutageDuring Outage

We examined IPv4 prefixes for 5% sample of reboots.19% of correlated IPv4 prefixes withdrawn by at least 90% of peers during router outage window.

Control

Outage

Summary• Step towards root-cause analysis of

inter-domain routing outages and events

- Explore applicability of method to measurement of other critical Internet infrastructure: DNS, Web, Email

• In our 2.5 year sample of 59K routers that rebooted

- 4% (2.3K) were SPoF

- SPoF were mostly confined to the edge: 59% customer edge

• We released our code as part of scamper

33https://www.caida.org/tools/measurement/scamper/

2 1 0 −1 −2 −3 −4 −5Distance of Router from Destination AS (IP hops)

Frac

tion

of R

eboo

t/Pre

fix P

airs

0

0.1

Churn

0.2

0.3

0.4

0.5

12 11 10

Partial Withdrawal

9 8 7 6 5 4 3

Complete Withdrawal

Backup Slides

34

Impact on IPv4 Services

35

We examined IPv4 prefixes for 5% sample of reboots where at least 90% of peers during router outage window.

Active Hosts 39,107HTTP 25,592HTTPS 16,321

SSH 11,277DNS 7,922SMTP 7,383IMAP 5,127

censys.io April 2017

Web}

Email}

Partial Withdrawals

36

50% of pairs had1−2 peers withdraw

nearly all peers withdraw10% of pairs had

0

1

0 0.2 0.4 0.6 0.8 1Fraction of Peers Withdrawing Route

CD

F 0.6

0.4

0.2

0.8

50% of pairs had 1-2 peers withdraw prefix 10% of pairs had nearly all peers withdraw prefix

Degrees of ASes monitored

37

0

0.2

0.4

0.6

0.8

1

1 10 100 1000

Cum

ulat

ive

Frac

tion

of R

eboo

ting

ASes

AS Degree

Single Points of FailureMonitored Population

ASes that were inferred to have a SPoF were disproportionately low-degree ASes

Activity for IPv4 prefixes in BGP

38

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Cum

ulat

ive

Frac

tion

Rou

ter O

utag

es

Peers Sending Updates/Total Peers

Before OutageDuring Outage

At least 70% of peers reported BGP activity on IPv4prefixes for 50% of the inferred router outages

Reboot Window Durations

39

max

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

1min

5min

15min

30min

1hr

2hr

4hr

8hr

16hr

Reboot Window Duration

CDF

min

0

Half the maximum reboot lengths were less than 30 minutes (~two probing rounds)

Router + BGP outage correlation

40

Outage Window

Withdraw-Contained

10, 11, 12 1, 2, 3

W A

Router IP-ID Sequence:

Outage-ContainedW A

Withdraw-BeforeW A

Announce-AfterW A

BGP Sequence:

Data processing pipeline

41

UptimeProber

Cassandra

CAIDAIPv6Topology

rtr targets

<ip,time,ipid>

RouteViews

<peer,time,prefix>

InferredReboots

BGPCorrelation

single pointsof failure

AS borderdistance

Inferring router position

42

x1R1 R3

x2 x3R2 R5

y1 y2R4

0 -1 -212(a) interface addresses routed by Y appear in traceroute

x1R1 R3

x2 x3R2 R5R4

012(b) no interface addresses routed by Y appear in traceroute

Customer Edge(CE) Router

Provider Edge(PE) Router

? ?

AS X AS Y

AS X AS Y

Data Collection Summary

43

18 Jan ’1518 Oct ’16

(a)

18 Oct ’1624 Feb ’17

(b)

24 Feb ’1730 May ’17

(c)

Probing rate 100 pps 225 pps 200 pps

Interfaces 83K seen Dec ‘14

1.1M seen Jun to Oct ’16

Dynamic. 2.4Min May ‘17

Responsive every round~15 mins

every round~15 mins

every round~15 mins

Unresponsive 12-24 hours 12-24 hours 7-14 days

Why IPv6 fragment IDs?

44

At 100Mbps and 1500 byte packets.Nyquist rate dictates a 4 second probing interval

IPv4 ID values are 16 bits with bursty velocity as every packet requires a unique value.

source addressdestination address

TTL protocol checksumidentification

Ver DSCP lengthHLoffset

Why IPv6 fragment IDs?

45

IPv6 ID values are 32 bits with low velocityas systems rarely send fragmented packets.

source address

destination address

protocol TTL

identification

Ver DSCP flow idpayload length

protocol offsetreserved

Soliciting IPv6 Fragment IDs

46

echo request, 1300 bytes

packet too big, MTU 1280

echo reply, 1300 bytes

echo request, 1300 bytes

echo reply, 1280 bytes

Fragment ID: 12345