23
Traffic Engineering with Forward Fault Correction (FFC) Hongqiang “Harry” Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University) 1

Traffic Engineering with Forward Fault Correction (FFC)

Embed Size (px)

DESCRIPTION

Traffic Engineering with Forward Fault Correction (FFC). Hongqiang “Harry” Liu , Srikanth Kandula, Ratul Mahajan, Ming Zhang, David Gelernter (Yale University). Cloud services require large network capacity. Cloud Services. Growing traffic. Cloud Networks. Expensive - PowerPoint PPT Presentation

Citation preview

Page 1: Traffic Engineering with Forward Fault Correction (FFC)

Traffic Engineering with Forward Fault Correction (FFC)

Hongqiang “Harry” Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang,

David Gelernter (Yale University)

1

Page 2: Traffic Engineering with Forward Fault Correction (FFC)

Cloud services require large network capacity

2

Cloud Services

Cloud NetworksGrowing traffic

Expensive(e.g. cost of WAN: $100M/year)

Page 3: Traffic Engineering with Forward Fault Correction (FFC)

TE is critical to effectively utilizing networks

3

Traffic Engineering

WAN Network

• Microsoft SWAN• Google B4• ……

Datacenter Network

• Devoflow• MicroTE• ……

Page 4: Traffic Engineering with Forward Fault Correction (FFC)

But, TE is also vulnerable to faults

4

TE controller

Network

Network view

Network configuration

Frequent updates for high utilization

Control-plane faults

Data-plane faults

Page 5: Traffic Engineering with Forward Fault Correction (FFC)

Control plan faults

5

Failures or long delays to configure a network device

TE Controller Switch

TE configurations

Memory shortage

RPC failure

Firmware bugs Overloaded CPU

Page 6: Traffic Engineering with Forward Fault Correction (FFC)

Congestion due to control plane faults

s1

s2

s3

s4

Link Capacity: 10 7

3

3

7

10s1

10

10

10

10

s2

s3

s4

6

New Flows (traffic demands):s1 s2 (10)s1 s3 (10)s1 s4 (10)

Configuration failure

Congestions1

7

3

10

10

10

s2

s3

s4

10

s2 s4 (10)

s3 s4 (10)

Page 7: Traffic Engineering with Forward Fault Correction (FFC)

Data plane faults

7

Link and switch failures

s1

7

3

3

7

s2

s3

s4

link failure link failure

s110

7

s2

s3

s43congestion

Rescaling: Sending traffic proportionally to residual paths

Link Capacity: 10

s2 s4 (10)

s3 s4 (10)

Page 8: Traffic Engineering with Forward Fault Correction (FFC)

Control and data plane faults in practice

8

Control plane: fault rate = 0.1% -- 1% per TE update.

Data plane: fault rate = 25% per 5 minutes.

In production networks:• Faults are common.• Faults cause severe congestion.

Page 9: Traffic Engineering with Forward Fault Correction (FFC)

State of the art for handling faults

•Heavy over-provisioning

•Reactive handling of faults• Control plane faults: retry • Data plane faults: re-compute TE and update networks

9

Cannot prevent congestion

Slow(seconds -- minutes)

Blocked by control plane faults

Big loss in throughput

Page 10: Traffic Engineering with Forward Fault Correction (FFC)

10

How about handling congestion proactively?

Page 11: Traffic Engineering with Forward Fault Correction (FFC)

Forward fault correction (FFC) in TE

11

• [Bad News] Individual faults are unpredictable.• [Good News] Simultaneous #faults is small.

Network faults

Packet loss

FFC guarantees no congestion under up to k arbitrary faults.

FEC guarantees no information loss under up to k arbitrary packet drops.

with careful data encoding

with careful traffic distribution

Page 12: Traffic Engineering with Forward Fault Correction (FFC)

Example: FFC for control plane faults

s1

s2

s3

s4

Link Capacity: 10 7

3

3

7

10s1

10

10

10

10

s2

s3

s4

Configuration failure

Congestions1

7

3

10

10

10

s2

s3

s4

10

Non-FFC

12

Page 13: Traffic Engineering with Forward Fault Correction (FFC)

Example: FFC for control plane faults

s1

s2

s3

s4

Link Capacity: 10 7

3

3

7

13

Configuration failure

s1

7

3

10

10

10

s2

s3

s4

7

s1

s2

s3

s4

10

10

10

10

7

Control Plane FFC (k=1)

Configuration failure

s1

10

7

10

10

s2

s3

s47

3

Page 14: Traffic Engineering with Forward Fault Correction (FFC)

Trade-off: network utilization vs. robustness

10s1

10

10

10

10

s2

s3

s4

Non-FFC

s1

s2

s3

s4

10

10

10

10

7

K=1 (Control Plane FFC)

s1

10

4

10

s2

s3

s4

10

10

K=2 (Control Plane FFC)

14

Throughput: 44Throughput: 47Throughput: 50

Page 15: Traffic Engineering with Forward Fault Correction (FFC)

Systematically realizing FFC in TE

15

Formulation:How to merge FFC into existing TE framework?

Computation:How to find FFC-TE efficiently?

Page 16: Traffic Engineering with Forward Fault Correction (FFC)

Basic TE linear programming formulations

16

max.

Sizes of flows

control plane faults

Deliver all granted flows

No overloaded link

FFC constraints:

Maximizing throughput

No overloaded link up to link failures

switch failures

TE decisions:Traffic on paths

Basic TE constraints:

TE objective:

𝑏𝑓

𝑙 𝑓 , 𝑡

s.t.

LP formulations

Page 17: Traffic Engineering with Forward Fault Correction (FFC)

Formulating control plane FFC

17

s1 s2

𝑓 1

𝑓 2

𝑓 3

𝑙1𝑛𝑒𝑤+𝑙2

𝑛𝑒𝑤+ 𝑙3𝑜𝑙𝑑≤ link cap

𝑙1𝑛𝑒𝑤+𝑙2

𝑜𝑙𝑑+𝑙3𝑛𝑒𝑤 ≤ link cap

𝑙1𝑜𝑙𝑑+𝑙2

𝑛𝑒𝑤+𝑙3𝑛𝑒𝑤 ≤ link cap

(𝟑𝟏)’s load in old TE ’s load in new TE

Fault on :

Fault on :

Fault on :

Total load under faults?

With n flows and FFC protection k: #constraints = for each link.

Challenge: too many constraints

Page 18: Traffic Engineering with Forward Fault Correction (FFC)

An efficient and precise solution to FFC

19

Our approach:A lossless compression from O() constraints to O(kn) constraints.

Given , FFC requires thatthe sum of arbitrary k elements in is link spare capacity

O()

Define as the mth largest element in : link spare capacity

Expressing with ?

O()

: additional load due to fault-i

Total load under faults link capacity

Total additional load due to faults link spare capacity

Page 19: Traffic Engineering with Forward Fault Correction (FFC)

Sorting network

19

𝑥1𝑥2𝑥3𝑥4

A comparison𝑥1𝑥2

=max{}

=min{}

𝑧1𝑧 2𝑧 3

𝑧 4 𝑧5 (1st largest) (2nd largest)

𝑧 6𝑧7𝑧 8

1st round 2nd round

• Complexity: O(kn) additional variables and constraints.

• Throughput: optimal in control-plane and data plane if paths are disjoint.

+ link spare capacity

Page 20: Traffic Engineering with Forward Fault Correction (FFC)

FFC extensions

• Differential protection for different traffic priorities

•Minimizing congestion risks without rate limiters

• Control plane faults on rate limiters

• Uncertainty in current TE

• Different TE objectives (e.g. max-min fairness)

• …20

Page 21: Traffic Engineering with Forward Fault Correction (FFC)

Evaluation overview

•Testbed experiment• FFC can be implemented in commodity switches• FFC has no data loss due to congestion under faults

• Large-scale simulation

21

A WAN network with O(100) switches and O(1000) links

Injecting faults based on real failure reports

Single priority traffic in a well-provisioned network

Multiple priority traffic in a well-utilized network

Page 22: Traffic Engineering with Forward Fault Correction (FFC)

FFC prevents congestion with negligible throughput loss

FFC Throughput / Optimal Throughput

40

60

20

0

Rati

o (%

)

22

80

100

High priority (High FFC protection)Medium priority (Low FFC protection)Low priority (No FFC protection)

Single priority

FFC Data-loss / Non-FFC Data-loss

160%

<0.01%

Page 23: Traffic Engineering with Forward Fault Correction (FFC)

Conclusions

• Centralized TE is critical to high network utilization but is vulnerable to control and data plane faults.

• FFC proactively handle these faults.• Guarantee: no congestion when #faults ≤ k.• Efficiently computable with low throughput overhead in practice.

23

Heavy network over-provisioning

High risk of congestion

FFC