Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown (sundaes,nickm)@stanford.edu Departments of Electrical Engineering & Computer Science,

High PerformanceSwitching and RoutingTelecom Center Workshop: Sept 4, 1997.

Making Parallel Packet Switches Practical

Sundar Iyer, Nick McKeown(sundaes,nickm)@stanford.eduDepartments of Electrical Engineering & Computer Science, Stanford Universityhttp://klamath.stanford.edu/pps

Stanford University 2

Motivation

To design and analyze:

– an architecture of a very high capacity packet switch

– in which the memories run slower than the line rate”

[Ref: S. Iyer, A. Awadallah, N. McKeown, “Analysis ofPacket Switch with Memories Running Slower than the LineRate, Proc. Infocom, Tel Aviv, Mar 2000.]


What limits capacity of packet switches today?

• Memory bandwidth for packet buffers– Shared memory: B = 2NR– Input queued: B = 2R

• Switch Arbitration– At the line rate R

• Packet Processing– At the line rate R


How can we scale the capacity of switches?

What we’d like:R

R R

RNxN

The building blocks we’d like to use:

R

RR

R Slower NxN Switches

Large NxN Switch


Why this might be a good idea

• Larger Capacity

• Slower than the line rate

– Buffering– Arbitration– Packet Processing

• Redundancy


Observations and Questions

• Random load-balancing: – It’s hard to predict system

performance.

• Flow-by-flow load-balancing: – Worst-case performance is very

poor.

• Can we do better?– What if we switch packet by

packet?– Can we achieve 100% throughput– Can we give delay guarantees?

1

2

…

…

k

R

R

R

R/k R/k

R

R

R


Architecture of a PPS

Definition:

A PPS is comprised of multiple identical lower-speed packet-switches operating independently and in parallel. An incoming stream of packets is spread, packet-by-packet, by a demultiplexor across the slower packet-switches, then recombined by a multiplexor at the output.

We call this “parallel packet switching”


Architecture of a PPS

OQ Switch

OQ Switch

OQ Switch

1

2

3

N=4

R

R

R

R

1

2

3

N=4

R

R

R

R

MultiplexorDemultiplexor

Demultiplexor

Demultiplexor

Demultiplexor

Multiplexor

Multiplexor

Multiplexor

(sR/k) (sR/k)

k=3

1

2

(sR/k) (sR/k)


We will compare it to an OQ Switch

1

2

N

1

2

N

Output Queued Switch

R

R

R

R

R

R

R

R

Internal BW = 2NR

• Why?

– There is no internal contention – No queueing at the inputs– They give the minimum delay– They can give QoS guarantees


Definition

• Relative Queueing Delay

– This is defined as the increased queueing delay faced by a cell in the PPS relative to the delay it receives in a shadow output queued switch

– It includes the time difference attributed only due to queueing

– A switch is said to emulate an OQ switch if the relative queueing delay is zero


A PPS which a bounded relative delayShadow OQ Switch

R

R

R

R

R

R

R

R

PPS

Yes

No=?

C

C

t

C

Ct’

C C

t”Ct’ t” –t’ <Constant


Problem Statement Redefined

Motivation:

“To design and analyze an architecture of a very high capacity packet switch in which the memories run slower than the line rate, which preserves the good properties of an OQ switch”

This talk:

Expanding the capacity of a FIFO packet switch, with a bounded relative queueing delay, using the PPS architecture.


Layer 1

Layer 2

Layer 3

1

2

3

N=4

R

R

R

R

1

2

3

N=4

R

R

R

R

2

2

415

3

1

2

1

3

2

1

4

2

3

1

4

2 13

4

1234

5

123

5

1234 1234

5

12345

R/3

R/3

R/3

A Bad Scenario for the PPS


Parallel Packet SwitchResult

Theorem:

• If S >= 2 then a PPS can emulate a FIFO OQ switch for all traffic patterns.


Is this Practical?

• Load Balancing Algorithm – Is Centralized– Requires N2 communication complexity– Ideally we want a distributed algorithm

• Speedup– A speedup of 2 is required– We would ideally like no speedup


Layer 1

Layer 2

Layer 3

1

2

3

N=4

R

R

R

R

1

2

3

N=4

R

R

R

R

2R/3

R/3

R/3

Load Balancing in a PPS

1

3

11 1

2

3

2

3

2 2

3

123

123

123

123


Distribution of Cells from a Demultiplexor

R124 3568 791012 11

131416 1551115

16 4

128912 7 R/3

R/3

R/3

Demultiplexor: Input 1

FIFOs for all k=3 layers

912

127

8 51115

36

1014

13

3613 1014

416

• Cells from every input to a every output are sent to the center stage switches in a round robin manner

“No more than 4 consecutive cells can go to the same FIFO i.e. center stage switch”


Modification to the PPS

• Relax the relative queueing bound– Allows a distributed load balancing arbiter

• Run an independent load balancing algorithm on each demultiplexor– Eliminates N2 Communication Complexity

• Keep small & bounded delay buffers at the demultiplexor– Eliminates speedup in the links between the

demultiplexor and the center stage switches


Layer 1

Layer 2

Layer 3

1

2

3

N=4

R

R

R

R

1

2

3

N=4

R

R

R

R

2

1

2

3 3

2

1R/3

R/3

R/3

Cells as seen by the Multiplexor

1

3

2

1

3

2

1

2

3

123

123

123

123


Solution

• Read – cells from the corresponding queues (which may

be out of order) based on the arrival time from all center stage switches to maintain throughput

• Introduce – a small and bounded re-sequencing buffer at the

multiplexor to re-order cells and send them in sequence

• Tolerate – a bounded delay relative to the shadow FIFO OQ

switch


Properties of the PPS Demultiplexor

• Demultiplexors

– Cells arrive at combined rate R over all k FIFOs

– Each cell has a property: output

– Cells to same output are inserted into the k FIFOs in RR.

– Cells are written into each FIFO buffer at leaky bucket rate of less than R/Ck + N

– Cells are read from each FIFOs at constant service rate R/k

– Max delay faced by a cell is N internal time slots


Relative Queueing Delay faced by a Cell

• Demultiplexors– A maximum relative queueing delay of N

internal time slots is encountered by a cell

• Multiplexors– A maximum relative queueing delay of N

internal time slots is encountered by a cell

• Total Relative Queueing Delay– 2N time slots


Buffered PPSResults

• A PPS with a completely distributed algorithm and no speedup with a buffer of size Nk, can emulate a FIFO output queued switch for all traffic patterns within a relative queueing delay bound of 2N internal time slots I.e. 2Nk time slots.


Conclusion

– Its possible to expand the capacity of a FIFO packet switch using multiple slower speed packet switches.

– There remain a couple of open questions• Making QoS practical.• Making multicasting practical.

Documents

Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown (sundaes,nickm)@stanford.edu Departments of Electrical Engineering & Computer Science,