A Deficit Round Robin 20MB/s Layer 2 Switch

1

A Deficit Round Robin 20MB/s Layer 2 Switch

Muraleedhara Navada

Francois Labonte

2

Fairness in Switches

• How to provide fair bandwidth allocation at output link ? – Simple FIFO favors greedy

flow

• Separate flows into FIFOs at output– Bit by Bit fair queuing

– Weighted Fair Queuing allows different weight for flows

– Packetized Weighted Fair Queuing (aka PGPS) calculates departure time for each packet

Output Queued Switch

50 100

505050

505050

150 Round-Robin bit by bit allocation

3

Deficit Round Robin

• Packetized Weighted Fair Queuing is complicated to implement

• Deficit Round Robin keeps track of credits for each flow– Flow sends according

credits– Add credits according to

weight– Essentially PWFQ at

coarser level

505050

505050

150

7550 100

75

75

75

Credits

505050

5050

50

150

7550 100

25

25

75

Credits

5050

5050

150

15050 100

100

100

150

Credits

Tim

e

4

NetFPGA System

• 8 Port 10MB/s duplex ethernet

• Control FPGA (CFPGA) handles physical interface (MAC)

• Our design targets both the User FPGAs (UFPGA)

CFPGA

UFPGA1

UFPGA0

1MB SRAM

1MB SRAM

1MB SRAM

10MB/s Ethernet

5

Design Considerations

• 4 MACs behind each port (8)• Each flow is a unique Source Address –

Destination Address pair– ~1024 flows

• Split across FPGAs – Each UFPGAs read incoming packets from

different ports(0-3 and 4-7) – tradeoff between memory storage and

fairness across all flows

6

Memory Buffer Allocation

• Static Partitioning of 1MB SRAM across 512 flows gives 2kbytes per flow < 2 max size packets

• Need more dynamic allocation– Segments: smaller size means less

fragmentation, but more pointer and list handling overhead

• 128 bytes was chosen

– Keep free segments list– Save on-chip only pointer to head

and tail of each flow

P4

P5

P5

P6

P1

P1

P2

P3

7

MAC address Learning

• Instead of telling which MAC addresses belong to which port

• Learn them from the source address– Note that our split FPGA design (reading from

different ports) require them to communicate the MACs learned between them

• When destination MAC is not learned yet, broadcast (send to all other ports).

• So MAC learning implies broadcast capability

8

Read Operation

Master Control

Packet Memory Manager

MAC Learning Flow Assignment

DRR Engine

Control Handler

1 MB SRAM

CFPGAInterface

DA, SA

Flow

ID

Flow Tail

Length, ptr

Read, port Sha

re S

A

9

Write Operation

Master Control

Packet Memory Manager

MAC Learning Flow Assignment

DRR Engine

Control Handler

1 MB SRAM

CFPGAInterface

Head, length

Next head, length, latency

Write, port

Port REQ

Port GNT

Dat

a R

eady

10

DRR Engine

• How to handle 512 flows and stay work conserving:– Only one flow active at any

time– DRR allocation happens on

dequeuing– Fifos contain the next flow to

be serviced for each port• Statistics per flow

– Weight– Latency – Byte sent – Packet sent– Packets active

FLOW data512 x 160bits SRAM

Port 0 F

IFO

Port 1 F

IFO

Port 2 F

IFO

Port 3 F

IFO

Port 4 F

IFO

Port 5 F

IFO

Port 6 F

IFO

Port 7 F

IFO

11

Conclusion

• A Deficit Round Robin Switch with 1k flows has been implemented

• Provides dynamic memory buffer allocation, MAC learning and broadcast

• Parallel design split across 2 chips

• Gathers statistics on flows

Documents

A Deficit Round Robin 20MB/s Layer 2 Switch