TCP CONGESTION SIGNATURES - SIGCOMMTCP’s RTT Congestion Signatures • Flows experiencing...

Preview:

Citation preview

w w w .caida.org

TCP CONGESTION SIGNATURES

1

Srikanth Sundaresan (Princeton Univ.)Amogh Dhamdhere (CAIDA/UCSD)

kc Claffy (CAIDA/UCSD)Mark Allman (ICSI)

w w w .caida.or

Typical Speed Tests Don’t Tell Us Much

2

w w w .caida.or

Typical Speed Tests Don’t Tell Us Much

2

w w w .caida.or

Typical Speed Tests Don’t Tell Us Much

2

w w w .caida.or

Typical Speed Tests Don’t Tell Us Much

• Upload and download throughput measurements: no information beyond that

2

w w w .caida.or

Typical Speed Tests Don’t Tell Us Much

2

What type of congestion did the TCP flow experience?

w w w .caida.or

Two Potential Sources of Congestion in the End-to-end Path

3

w w w .caida.or

Two Potential Sources of Congestion in the End-to-end Path

• Self-induced congestion

- Clear path, the flow itself induced congestion

- eg: last-mile access link

3

w w w .caida.or

Two Potential Sources of Congestion in the End-to-end Path

• Self-induced congestion

- Clear path, the flow itself induced congestion

- eg: last-mile access link

• External congestion

- Flow starts on an already congested path

- eg: congested interconnect

3

w w w .caida.or

Two Potential Sources of Congestion in the End-to-end Path

• Self-induced congestion

- Clear path, the flow itself induced congestion

- eg: last-mile access link

• External congestion

- Flow starts on an already congested path

- eg: congested interconnect

3

Distinguishing the two cases has implications for users / ISPs / regulators

w w w .caida.or

How can we distinguish the two?

• Cannot distinguish using just throughput numbers

- Access plan rates vary widely, and are typically not available to content / speed test providers

- eg: Speed test reports 5 Mbps – is that the access link rate (DSL), or a congested path?

4

w w w .caida.or

How can we distinguish the two?

• Cannot distinguish using just throughput numbers

- Access plan rates vary widely, and are typically not available to content / speed test providers

- eg: Speed test reports 5 Mbps – is that the access link rate (DSL), or a congested path?

4

We can use the dynamics of TCP’s startup phase, i.e., Congestion Signatures

w w w .caida.or

TCP’s RTT Congestion Signatures

5

w w w .caida.or

TCP’s RTT Congestion Signatures

• Flows experiencing self-induced congestion fill up an empty buffer during slow start

- Hence increase the TCP flow RTT

5

w w w .caida.or

TCP’s RTT Congestion Signatures

• Flows experiencing self-induced congestion fill up an empty buffer during slow start

- Hence increase the TCP flow RTT

• Externally congested flows encounter an already full buffer

- Less potential for RTT increases

5

w w w .caida.or

TCP’s RTT Congestion Signatures

• Flows experiencing self-induced congestion fill up an empty buffer during slow start

- Hence increase the TCP flow RTT

• Externally congested flows encounter an already full buffer

- Less potential for RTT increases

• Self-induced congestion therefore has higher RTT variance compared to external congestion

5

w w w .caida.or

TCP’s RTT Congestion Signatures

• Flows experiencing self-induced congestion fill up an empty buffer during slow start

- Hence increase the TCP flow RTT

• Externally congested flows encounter an already full buffer

- Less potential for RTT increases

• Self-induced congestion therefore has higher RTT variance compared to external congestion

5

We can quantify this using Max-Min and CoV of RTT

w w w .caida.or

Example Controlled Experiment• 20 Mbps “access” link

with 100 ms buffer

• 1 Gbps “interconnect” link with 50 ms buffer

• Self-induced congestion flows have higher values for both metrics and are clearly distinguishable

Max-Min RTT

6

CoV RTT

101 1020.0

0.2

0.4

0.6

0.8

1.0

CDF

ExternalSelf

10−2 10−1 1000.0

0.2

0.4

0.6

0.8

1.0

CDF

ExternalSelf

w w w .caida.or

Example Controlled Experiment• 20 Mbps “access” link

with 100 ms buffer

• 1 Gbps “interconnect” link with 50 ms buffer

• Self-induced congestion flows have higher values for both metrics and are clearly distinguishable

The two types of congestion exhibit widely contrasting behaviors

Max-Min RTT

6

CoV RTT

101 1020.0

0.2

0.4

0.6

0.8

1.0

CDF

ExternalSelf

10−2 10−1 1000.0

0.2

0.4

0.6

0.8

1.0

CDF

ExternalSelf

w w w .caida.or

Model

• Max-min and CoV of RTT derived from RTT samples during slow start

• We feed the two metrics into a simple Decision Tree

- We control the depth of the tree to a low value to minimize complexity

• We build the decision tree classifier using controlled experiments and apply it to real-world data

7

w w w .caida.or

Validating the Method: Step 1- Controlled Experiments

8

Internet

R2R1

Server 1Server 2

Server 3

Server 4

Pi 1

Pi 2

100 Mbps

Shaped “access”

1 Gbps

w w w .caida.or

Validating the Method: Step 1- Controlled Experiments

8

Internet

R2R1

Server 1Server 2

Server 3

Server 4

Pi 1

Pi 2

100 Mbps

Shaped “access”

1 Gbps

Background cross-traffic

Interconnectcross-traffic

w w w .caida.or

Validating the Method: Step 1- Controlled Experiments

8

Internet

R2R1

Server 1Server 2

Server 3

Server 4

Pi 1

Pi 2

100 Mbps

Shaped “access”

1 Gbps

Throughput tests

Background cross-traffic

Interconnectcross-traffic

w w w .caida.or

It’s Real

9

w w w .caida.or

It’s Real

9

FantasticCablingeffort

Post-itdefined

networking

w w w .caida.or

Validating the Method: Step 1- Controlled Experiments

• Emulated access link + “core” link

- Wide range of access link throughputs, buffer sizes, loss rates, cross-traffic (background and congestion-inducing)

- Can accurately label flows in training data as “self ” or “externally” congested

10

Internet

R2R1

Server 1Server 2

Server 3

Server 4

Pi 1

Pi 2

100 Mbps

Shaped “access”

1 Gbps

Throughput tests

Background cross-traffic

Interconnectcross-traffic

w w w .caida.or

Validating the Method: Step 1- Controlled Experiments

11

Internet

R2R1

Server 1Server 2

Server 3

Server 4

Pi 1

Pi 2

100 Mbps

Shaped “access”

1 Gbps

Throughput tests

Background cross-traffic

Interconnectcross-traffic

High accuracy: precision and recall > 80%robust to model settings

w w w .caida.or

Validating the Method: Step 2

• From Ark VP in ISP A identified congested link with ISP B using TSLP*

12*Luckie et al. “Challenges in Inferring Internet Interdomain Congestion”, IMC 2014

ISP A

ISP B

Ark VP

w w w .caida.or

Validating the Method: Step 2

• From Ark VP in ISP A identified congested link with ISP B using TSLP*

12*Luckie et al. “Challenges in Inferring Internet Interdomain Congestion”, IMC 2014

ISP A

ISP B

congested link

Ark VP

w w w .caida.or

Validating the Method: Step 2

13

ISP A

ISP B

M-lab NDT server

congested link

Ark VP

• Periodic NDT tests from Ark VP to M-Lab NDT server “behind” the congested interdomain link

w w w .caida.or

Validation of the Method: Step 2

14

0 5

10 15 20 25 30

02/18 02/25 03/04 03/11

d/l M

bps

10 20 30 40 50 60 70

02/18 02/25 03/04 03/11TSLP

late

ncy

(far s

ide)

Strong correlation between throughput and TSLP latency: flows during elevated TSLP latency

labeled as “externally” congested

w w w .caida.or

Validation of the Method: Step 2

14

0 5

10 15 20 25 30

02/18 02/25 03/04 03/11

d/l M

bps

10 20 30 40 50 60 70

02/18 02/25 03/04 03/11TSLP

late

ncy

(far s

ide)

Strong correlation between throughput and TSLP latency: flows during elevated TSLP latency

labeled as “externally” congested

“Externally”congested

w w w .caida.or

Validation of the Method: Step 2

14

0 5

10 15 20 25 30

02/18 02/25 03/04 03/11

d/l M

bps

10 20 30 40 50 60 70

02/18 02/25 03/04 03/11TSLP

late

ncy

(far s

ide)

Strong correlation between throughput and TSLP latency: flows during elevated TSLP latency

labeled as “externally” congested

“Externally”congested

“self”congested

w w w .caida.or

Validation of the Method: Step 2

15

0 5

10 15 20 25 30

02/18 02/25 03/04 03/11

d/l M

bps

10 20 30 40 50 60 70

02/18 02/25 03/04 03/11TSLP

late

ncy

(far s

ide)

75%+ accuracy in detecting external congestion, 100% accuracy for self-induced congestion

w w w .caida.or

Validation of the Method: Step 3

• We use Measurement Lab’s NDT test data for real-world validation

• Cogent interconnect issue in late 2013/early 2014

- NDT tests to Cogent servers saw significant drops in throughput during peak hours

- Several major U.S. ISPs were affected, except Cox

- The problem was identified as congested interconnects

16

w w w .caida.or

Using the M-lab Data

17

January 2014

April 2014

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

w w w .caida.or

Using the M-lab Data

17

January 2014

Drop in peak-hour throughput for for Comcast, TWC, Verizon

April 2014

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

w w w .caida.or

Using the M-lab Data

17

January 2014

Cox not affected

Drop in peak-hour throughput for for Comcast, TWC, Verizon

April 2014

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

w w w .caida.or

Using the M-lab Data

17

January 2014

Cox not affected

Drop in peak-hour throughput for for Comcast, TWC, Verizon

April 2014

Interconnection dispute resolved; no diurnal effect

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

w w w .caida.or

Using the M-lab Data

18

Peak hour tests inJan/Feb 2014 are likely “externally” congested

Off-peak tests in Mar/Apr 2014 are likely

“self” congested

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

0 5 10 15 20

Hour of day (local)

10

20

30

40

Mbp

s

ComcastCox

TimeWarnerVerizon

w w w .caida.or

But didn’t you just say it’s hard to infer congestion using throughput tests??

19

w w w .caida.or

But didn’t you just say it’s hard to infer congestion using throughput tests??

• Yes :)

19

w w w .caida.or

But didn’t you just say it’s hard to infer congestion using throughput tests??

• Yes :)

• For that reason, our labeling is broad and coarse. All tests labeled “external” may not be traversing congested interconnects

19

w w w .caida.or

But didn’t you just say it’s hard to infer congestion using throughput tests??

• Yes :)

• For that reason, our labeling is broad and coarse. All tests labeled “external” may not be traversing congested interconnects

• We do not expect the technique to identify all peak hour tests as externally congested, and vice versa

- Looking for qualitative differences

19

w w w .caida.or

But didn’t you just say it’s hard to infer congestion using throughput tests??

• Yes :)

• For that reason, our labeling is broad and coarse. All tests labeled “external” may not be traversing congested interconnects

• We do not expect the technique to identify all peak hour tests as externally congested, and vice versa

- Looking for qualitative differences

• The general observations about congestion were verified by other sources, e.g., CAIDA’s TSLP measurements

19

w w w .caida.or

Applying the Model to M-lab data

20

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

0.2

0.4

0.6

0.8

1.0

%se

lf-in

duce

dco

nges

tion

Cogent (LAX) Cogent (LGA) Level3 (ATL)

Jan-Feb Mar-Apr

w w w .caida.or

Applying the Model to M-lab data

20

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

0.2

0.4

0.6

0.8

1.0

%se

lf-in

duce

dco

nges

tion

Cogent (LAX) Cogent (LGA) Level3 (ATL)

Jan-Feb Mar-Apr

Much lower incidences of self-induced congestion for Cogent in Jan/Feb 2014 as

compared to Mar/Apr

w w w .caida.or

Applying the Model to M-lab data

21

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

0.2

0.4

0.6

0.8

1.0

%se

lf-in

duce

dco

nges

tion

Cogent (LAX) Cogent (LGA) Level3 (ATL)

Jan-Feb Mar-Apr

Level3 does not show significant differences, was not affected by

interconnection disputes

w w w .caida.or

Applying the Model to M-lab data

22

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

Comcast

TimeWarnerVerizon Cox

0.2

0.4

0.6

0.8

1.0

%se

lf-in

duce

dco

nges

tion

Cogent (LAX) Cogent (LGA) Level3 (ATL)

Jan-Feb Mar-Apr

Cox does not show significant differences, was not affected by interconnection disputes

w w w .caida.or

Looking at Throughput

• What throughput should we observe for “self ” and “external” congested flows?

• With congested interconnects affecting many flows, both “self ” and “external” should see similar throughput

• Without congested interconnects affecting many flows, “self ” congested throughput should follow access link speeds, generally higher than “externally” congested

23

w w w .caida.or

Looking at Throughput

• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)

24

Comcast TimeWarner Verizon Cox0

5

10

15

20

Thro

ughp

ut(M

bps)

Jan-Feb SelfJan-Feb External

Mar-Apr SelfMar-Apr External

w w w .caida.or

Looking at Throughput

• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)

24

Comcast TimeWarner Verizon Cox0

5

10

15

20

Thro

ughp

ut(M

bps)

Jan-Feb SelfJan-Feb External

Mar-Apr SelfMar-Apr External

Both “self”and

“external”get similarthroughput

w w w .caida.or

Looking at Throughput

• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)

25

Comcast TimeWarner Verizon Cox0

5

10

15

20

Thro

ughp

ut(M

bps)

Jan-Feb SelfJan-Feb External

Mar-Apr SelfMar-Apr External

w w w .caida.or

Looking at Throughput

• Avg. throughput of self-induced congestion flows significantly higher than externally congested in Mar-Apr (no interconnection disputes)

25

Comcast TimeWarner Verizon Cox0

5

10

15

20

Thro

ughp

ut(M

bps)

Jan-Feb SelfJan-Feb External

Mar-Apr SelfMar-Apr External

“Self” gethigher

throughputthan

“external”

w w w .caida.or

Takeaways

• It is possible to distinguish two kinds of congestion: self-induced vs. externally congested

• The difference is important to identify the solution

- Upgrade service plan? Or talk to ISP?

- Also for regulatory purposes

• Simple, accurate technique using RTT during TCP slow start dynamics

- Can be easily computed using packet captures or other tools such as Web100 (future work)

26

w w w .caida.or

Limitations

• Relies on buffering effect

- May not work on TCP variants that minimize buffer occupancy, e.g., BBR

• Only uses slow start dynamics

- Might be confounded by flows that perform one way during slow start but differently afterward

• Real-world validation relies on coarsely labeled data

- It would be great to validate on more real-world data!

27

w w w .caida.or

Thanks!Questions?

28

Recommended