1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and

1

Tutorial Survey of LL-FC Methods for Datacenter Ethernet

101 Flow Control

M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald

Luijten and Clark Jeffries

26 Sept. 2006IBM Zurich Research Lab

2

Outline

• Part I Requirements of datacenter link-level flow control (LL-FC) Brief survey of top 3 LL-FC methods

PAUSE, aka. On/Off grants credit rate

Baseline performance evaluation

• Part II Selectivity and scope of LL-FC

per-what? : LL-FC’s resolution

3

Req’ts of .3x’: Next Generation of Ethernet Flow Control for Datacenters

1. Lossless operationNo-drop expectation of datacenter apps (storage, IPC)Low latency

2. Selective Discrimination granularity: link, prio/VL, VLAN, VC, flow...?Scope: Backpressure upstream one hop, k-hops, e2e...?

3. Simple...

PAUSE-compatible !!

4

Generic LL-FC System

• One link with 2 adjacent buffers: TX (SRC) and RX (DST) Round trip time (RTT) per link is system’s time constant

• LL-FC issues: link traversal (channel Bw allocation) RX buffer allocation pairwise-communication between channel’s terminations

signaling overhead (PAUSE, credit, rate commands) backpressure (BP):

increase / decrease injections stop and restart protocol

RTT

5

FC-Basics: PAUSE (On/Off Grants)

Xbar

Data Link

Down-stream Links

TX QueuesOQ

Threshold

Stop

Go

“Over-run”=Send STOP

RX Buffer OQ

PAUSE BP Semantics :STOP / GO / STOP..

Threshold

Stop

Go

“Over-run”=Send STOP

* Note: Selectivity and granularity of FC domains are not considered here.

FC Return path

6

FC-Basics: Credits

Xbar* Note: Selectivity and granularity of FC domains are not considered here.

7

Correctness: Min. Memory for “No Drop”

"Minimum“: to operate lossless => O(RTTlink)– Credit : 1 credit = 1 memory location– Grant : 5 (=RTT+1) memory locations

Credits– Under full load the single credit is constantly looping between RX and TX

RTT=4 => max. performance = f(up-link utilisation) = 25% Grants

– Determined by slow restart: if last packet has left the RX queue, it takes an RTT until the next packet arrives

8

PAUSE vs. Credit @ M = RTT+1

"Equivalent" = ‘fair’ comparison1. Credit scheme: 5 credit = 5 memory locations2. Grant scheme: 5 (=RTT+1) memory locations Performance loss for PAUSE/Grants is due to lack of underflow protection,

because if M < 2*RTT the link is not work-conserving (pipeline bubbles on restart)

For equivalent (to credit) performance, M=9 is required for PAUSE.

9

• RX queue Qi=1 (full capacity). • Max. flow (input arrivals) during one timestep

(Dt = 1) is 1/8. • Goal: update the TX probability Ti from any

sending node during the time interval [t, t+1) to obtain the new Ti applied during the time interval [t+1, t+2).

• Algorithm for obtaining Ti(t+1) from Ti(t) ... =>

• Initially the offered rate from source0 was set = .100 , and from source1 = .025. All other processing rates were .125. Hence all queues show low occupancy.

• At timestep 20, the flow rate to the sink was reduced to .050 => causing a congestion level in Queue2 of .125/.050 = 2.5 times processing capacity.

• Results: The average queue occupancies are .23 to .25, except Q3 = .13. The source flows are treated about equally and their long-term sum is about .050 (optimal).

FC-Basics: Rate

10

Conclusion Part I: Which Scheme is “Better”?

• PAUSE+ simple+ scalable (lower overhead of signalling)- 2xM size required

• Credits (absolute or incremental)+ are always lossless, independent of the RTT and memory size+ adopted by virtually all modern ICTNs (IBA, PCIe, FC, HT, ...)- not trivial for buffer-sharing- protocol reliability- scalability

• At equal M = RTT, credits show 30+% higher Tput vs. PAUSE

*Note: Stability of both was formally proven here

• Rate: in-between PAUSE and credits + adopted in adapters+ potential good match for BCN (e2e CM)- complexity (cheap fast bridges)

11

Part II: Selectivity and Scope of LL-FC“Per-Prio/VL PAUSE”

The FC-ed ‘link’ could be a physical channel (e.g. 802.3x) virtual lane (VL, e.g. IBA 2-16 VLs) virtual channel (VC, larger figure) ...

• Per-Prio/VL PAUSE is the often proposed PAUSE v2.0 ...

• Yet, is it good enough for the next decade of datacenter Ethernet?

• Evaluation of IBA vs. PCIe/As vs. NextGen-Bridge (PrizmaCi)

12

Already Implemented in IBA (and other ICTNs...)

• IBA has 15 FC-ed VLs for QoS SL-to-VL mapping is performed per hop, according to capabilities

• However, IBA doesn’t have VOQ-selective LL-FC “selective” = per switch (virtual) output port

• So what? Hogging - aka buffer monopolization, HOL1-blocking, output queue

lockup, single-stage congestion, saturation tree(k=0)

• How can we prove that hogging really occurs in IBA? A. Back-of-the-envelope reasoning B. Analytical modeling of stability and work-conservation (papers available)

C. Comparative simulations: IBA, PCI-AS etc. (next slides)

13

• Simulation: parallel backup to a RAID across an IBA switch TX / SRC

16 independent IBA sources, e.g. 16 “producer” CPU/threads SRC behavior: greedy, using any communication model (UD) SL: BE service discipline on a single VL

– (the other VLs suffer of their own ) Fabrics (single stage)

16x16 IBA generic SE 16x16 PCI-AS switch 16x16 Prizma CI switch

RX / DST 16 HDD “consumers” t0 : initially each HDD sinks data at full 1x (100%) tsim : during simulation HDD[0] enters thermal

recalibration or sector remapping; consequently » HDD[0] progressively slows down its incoming link throughput: 90, 80,..., 10%

IBA SE Hogging Scenario

14

First: Friendly Bernoulli Traffic

2 Sources (A, B) sending @ (12x + 4x) to 16*1x End Nodes (C..R)

link 0 throughput reduction

agg

reg

ate

thro

ug

hp

ut

achievable performance

actual IBA performance

R

Th

rou

gh

pu

t loss

Fig. from IBA Spec

15

Myths and Fallacies about Hogging

• Isn’t IBA’s static rate control sufficient?• No, because it is STATIC

• IBA’s VLs are sufficient...?!• No.

VLs and ports are orthogonal dimensions of LL-FC 1. VLs are for SL and QoS => VLs are assigned to prios, not ports! 2. Max. no. of VLs = 15 << max (SE_degree x SL) = 4K

• Can the SE buffer partitioning solve hogging, blocking and sat_trees, at least in single SE systems?

• No. 1. Partitioning makes sense only w/ Status-based FC (per bridge output

port - see PCIe/AS SBFC); IBA doesn’t have a native Status-based FC

2. Sizing becomes the issue => we need dedication per I and O ports M = O( SL * max{RTT, MTU} * N2 ) very large number! Academic papers and theoretical disertations prove stability and work-

conservation, but the amounts of required M are large

16

Conclusion Part II: Selectivity and Scope of LL-FC

Despite 16 VLs, IBA/DCE is exposed to the “transistor effect”: any single flow can modulate the aggregate Tput of all the others

Hogging (HOL1-blocking) requires a solution even for the smallest IBA/DCE system (single hop)

Prios/VL and VOQ/VC are 2 orthogonal dimensions of LL-FCQ: QoS violation as price of ‘non-blocking’ LL-FC?

• Possible granularities of LL-FC queuing domains: A. CM can serve in single hop fabrics also as LL-FC B. Introduce VOQ-FC: intermediate coarser grain

no. VCs = max{VOQ} * max{VL} = 64..4096 x 2..16 <= 64K VCs

Alternative: 802.1p (map prios to 8 VLs) + .1q (map VLANs to 4K VCs)?

Was proposed in 802.3ar...

17

Backup

18

Switch[k+1]RX Port[k+1, i]

RX Mgnt. Unit (BufferAllocation)

LL-FC TX Unit

“return path of LL-FC token"

VOQ[1]

VOQ[n]

LL-FC Reception

TX Scheduler

RX Buffer

LL-FC Between Two Bridges

"send packet"

Switch[k]TX Port[k,j]

Documents

1 Tutorial Survey of LL-FC Methods for Datacenter Ethernet 101 Flow Control M. Gusat Contributors: Ton Engbersen, Cyriel Minkenberg, Ronald Luijten and