101
1 COMP680E by M. Hamdi Can we make these scheduling algorithms simpler? Using a Simpler Architecture

Can we make these scheduling algorithms simpler? Using a Simpler Architecture

  • Upload
    maalik

  • View
    26

  • Download
    0

Embed Size (px)

DESCRIPTION

Can we make these scheduling algorithms simpler? Using a Simpler Architecture. Buffered Crossbar Switches. A buffered crossbar switch is a switch with buffered fabric (memory inside the crossbar). - PowerPoint PPT Presentation

Citation preview

Page 1: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

1COMP680E by M. Hamdi

Can we make these scheduling algorithms

simpler?Using a Simpler

Architecture

Page 2: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

2COMP680E by M. Hamdi

Buffered Crossbar Switches

• A buffered crossbar switch is a switch with buffered fabric (memory inside the crossbar).

• A pure buffered crossbar switch architecture, has only buffering inside the fabric and none anywhere else.

• Due to HOL blocking problem, VOQ are used in the input side.

Page 3: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

3COMP680E by M. Hamdi

Buffered Crossbar Architecture

….1

N

Arbit

er

….1

N

Arbit

er

….1

N

Arbit

er

Arbiter Arbiter Arbiter

1

N

2

…Data

Flow Control

Input Cards

………

Output Card

Output Card

Output Card

1 2 N

Page 4: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

4COMP680E by M. Hamdi

Scheduling Process

Scheduling is divided into three steps:

–Input scheduling: each input selects in a certain way one cell from the HoL of an eligible queue and sends it to the corresponding internal buffer.

–Output scheduling: each output selects in a certain way from all internally buffered cells in the crossbar to be delivered to the output port.

–Delivery notifying: for each delivered cell, inform the corresponding input of the internal buffer status.

Page 5: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

5COMP680E by M. Hamdi

Advantages

• Total independence between input and output arbiters (distributed design) (1/N complexity as compared to centralized schedulers)

• Performance of Switch is much better (because there is much less output contention) – a combination of IQ and OQ switches

• Disadvantage: Crossbar is more complicated

Page 6: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

6COMP680E by M. Hamdi

41 2 3

1

3

2

4

I/O Contention Resolution

Page 7: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

7COMP680E by M. Hamdi

41 2 3

1

3

2

4

I/O Contention Resolution

Page 8: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

8COMP680E by M. Hamdi

InRr-OutRr• Input scheduling: InRr (Round-Robin)

- Each input selects the next eligible VOQ, based on its highest priority pointer, and sends its HoL packet to the internal buffer.

• Output scheduling: OutRr (Round-Robin)

- Each output selects the next nonempty internal buffer, based on its highest priority pointer, and sends it to the output link.

The Round Robin Algorithm

Page 9: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

9COMP680E by M. Hamdi

41 2 3

1

3

2

44 13 2

4 13 2

4 13 2

4 13 2

Input Scheduling (InRr.)

Page 10: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

10COMP680E by M. Hamdi 41 2 3

1

3

2

44 13 2

4 13 2

4 13 2

4 13 2

4 13 2

4 13 2

4 13 2

4 13 2

Output Scheduling (OutRr.)

Page 11: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

11COMP680E by M. Hamdi 41 2 3

4 13 2

4 13 2

4 13 2

4 13 2

1

3

2

44 13 2

4 13 2

4 13 2

4 13 2

Out. Ptrs Updt + Notification delivery

Page 12: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

12COMP680E by M. Hamdi

Performance study

2)(, .... 2)(,

... 2)(,

... 2)(

)( nNNVOQnN

VOQnN

VOQnVOQnL 1111,

Delay/throughput under Bernoulli Uniform and Burtsy Uniform

Stability performance:

Page 13: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

13COMP680E by M. Hamdi

Bernoulli Uniform Arrivals

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

100

101

102

103

Ave

rag

e D

elay

Normalized Load

32x32 Switch under Bernoulli Uniform Traffic

OQRR-RR1-SLIP4-SLIP

Page 14: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

14COMP680E by M. Hamdi

Bursty Uniform Arrivals

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

101

102

103

Ave

rage

Del

ay

Normalized Load

32x32 Switch under Bursty Uniform Traffic

OQRR-RR1-SLIP4-SLIP

Page 15: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

15COMP680E by M. Hamdi

Scheduling Process

Because the arbitration is simple:

– We can afford to have algorithms based on weights for example (LQF, OCF).

– We can afford to have algorithms that provide QoS

Page 16: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

16COMP680E by M. Hamdi

Buffered Crossbar Solution: Scheduler

• The algorithm MVF-RR is composed of two parts:

– Input scheduler – MVF (most vacancies first)

Each input selects the column of internal buffers (destined to the same output) where there are most vacancies (non-full buffers).

– Output scheduler – Round-robin

Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.

Page 17: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

17COMP680E by M. Hamdi

Buffered Crossbar Solution: Scheduler

• The algorithm ECF-RR is composed of two parts:– Input scheduler – ECF (empty column first)

Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis.

– Output scheduler – Round-robin

Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.

Page 18: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

18COMP680E by M. Hamdi

Buffered Crossbar Solution: Scheduler

• The algorithm RR-REMOVE is composed of two parts:

– Input scheduler – Round-robin (with remove-request signal sending)

Each input chooses non-empty VOQ which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one. It then sends out at most one remove-request signal to outputs

– Output scheduler – REMOVE

For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.

Page 19: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

19COMP680E by M. Hamdi

Buffered Crossbar Solution: Scheduler

• The algorithm ECF-REMOVE is composed of two parts:– Input scheduler – ECF (with remove-request signal

sending)

Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis.It then sends out at most one remove-request signal to outputs

– Output scheduler – REMOVE

For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.

Page 20: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

20COMP680E by M. Hamdi

Hardware Implementation of ECF-RR: An Input Scheduling Block

Round-robin arbiter

Round-robin arbiter

Selector 0 SelectorN-1

Any grant

Arbitration results

Grants Grants

Highest priority pointer

)()( 0 oi BOQO )()( 11 NiN BOQO )( 0iQO )( 1iNQO

Page 21: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

21COMP680E by M. Hamdi

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

1.18

1.2

Normalized load

Rela

tive

ave

rage d

ela

y

32x32 Switch under uniform traffic

RR-RR IBM MVF-RR ECF-RR RR-REMOVE ECF-REMOVEOutput

Performance Evaluation: Simulation StudyU

nif

orm

Tra

ffic

Page 22: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

22COMP680E by M. Hamdi

Performance Evaluation: Simulation Study

Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99

Improvement

Percentage 1% 1% 3% 6% 13% 17% 12%

Normalized Improvement Percentage

1% 1% 3% 6% 12% 15% 11%

Improvement Factor

1.01 1.01 1.03 1.06 1.13 1.17 1.12EC

F-R

EM

OV

e

over

RR

-RR

Page 23: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

23COMP680E by M. Hamdi

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.95

1

1.05

1.1

1.15

1.2

1.25

1.3

Normalized load

Ave

rage d

ela

y

32x32 Switch under uniform bursty traffic (average burst size:16)

RR-RR IBM MVF-RR ECF-RR RR-REMOVE ECF-REMOVEOutput

Performance Evaluation : Simulation StudyB

urs

ty

Tra

ffic

Page 24: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

24COMP680E by M. Hamdi

Performance Evaluation: Simulation Study

Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99

Improvement

Percentage 10% 13% 16% 20% 22% 18% 11%

Normalized Improvement Percentage

9% 12% 14% 16% 18% 16% 10%

Improvement Factor

1.10 1.13 1.16 1.20 1.22 1.18 1.11EC

F-R

EM

OV

e

over

RR

-RR

Page 25: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

25COMP680E by M. Hamdi

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.551

1.005

1.01

1.015

Normalized load

Ave

rage d

ela

y

32x32 Switch under hotspot traffic

RR-RR IBM MVF-RR ECF-RR RR-REMOVE ECF-REMOVEOutput

Performance Evaluation : Simulation StudyH

ots

pot

Tra

ffic

Page 26: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

26COMP680E by M. Hamdi

Performance Evaluation: Simulation Study

Load 0.31 0.36 0.41 0.45 0.49 0.51

Improvement

Percentage 0.2% 0.3% 0.5% 0.8% 1% 0.7%

Normalized Improvement Percentage

0.2% 0.3% 0.5% 0.8% 1% 0.7%

Improvement Factor 1.002 1.003 1.005 1.008 1.01 1.007EC

F-R

EM

OV

e

over

RR

-RR

Page 27: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

27COMP680E by M. Hamdi

Quality of Service Mechanisms for

Switches/Routers and the Internet

Page 28: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

28COMP680E by M. Hamdi

Recap

• High-Performance Switch Design

– We need scalable switch fabrics – crossbar, bit-sliced crossbar, Clos networks.

– We need to solve the memory bandwidth problem

Our conclusion is to go for input queued-switches

We need to use VOQ instead of FIFO queues

– For these switches to function at high-speed, we need efficient and practically implementable scheduling/arbitration algorithms

Page 29: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

29COMP680E by M. Hamdi

Algorithms for VOQ Switching

• We analyzed several algorithms for matching inputs and outputs– Maximum size matching: these are based on

bipartite maximum matching – which can be solved using Max-flow techniques in O(N2.5)These are not practical for high-speed implementationsThey are stable (100% throughput for uniform traffic)They are not stable for non-uniform traffic

– Maximal size matching: they try to approximate maximum size matching

• PIM, iSLIP, SRR, etc. These are practical – can be executed in parallel in O(logN) or even O(1) They are stable for uniform traffic and unstable for non-uniform traffic

Page 30: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

30COMP680E by M. Hamdi

Algorithms for VOQ Switching

– Maximum weight matching: These are maximum matchings based weights such queue length (LQF) (LPF) or age of cell (OCF) with a complexity of O(N3logN)

• These are not practical for high-speed implementations. Much more difficult to implement than maximum size matching

• They are stable (100% throughput) under any admissible traffic

– Maximal weight matching: they try to approximate maximum weight matching. They use RGA mechanism like iSLIP

• iLQF, iLPF, iOCF, etc. These are “somewhat” practical – can be executed in parallel in

O(longN) or even O(1) like iSLIP BUT the arbiters are much more complex to build

They are “recently” shown to be stable under any admissible traffic

Page 31: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

31COMP680E by M. Hamdi

Algorithms for VOQ Switching

– Randomized algorithms• They try in a smart way to approximate maximum weight matching by

avoiding using an iterative process• They are stable under any admissible traffic• Their time complexity is small (depending on the algorithm)• Their hardware complexity is yet untested.

– No schedulers – deal with mis-sequencing of packets– Distributed schedulers – buffered crossbars– Two important points to remember

• The time complexity of an algorithm is not a “true” indication of its hardware implementation

• 100% throughput does not mean a low delay• “Weak” vs. “Strong” stability

Page 32: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

32COMP680E by M. Hamdi

VOQ Algorithms and Delay• But, delay is key

– Because users don’t care about throughput alone– They care (more) about delays– Delay = QoS (= $ for the network operator)

• Why is delay difficult to approach theoretically?– Mainly because it is a statistical quantity– It depends on the traffic statistics at the inputs– It depends on the particular scheduling algorithm used

The last point makes it difficult to analyze delays in i /q switches For example in VOQ switches, it is almost impossible to give any

guarantees on delay. All you can hope for is to have a high throughput and a bounded

queue length – bounded average delay (but even the bound on the queue length is beyond the control of the algorithm – we cannot say that the length of the queue should not be more than 10).

Page 33: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

33COMP680E by M. Hamdi

VOQ Algorithms and Delay• This does not mean that we cannot have an algorithm that

can do that. It means there exist none at this moment.

• For this exact reason, almost all quality of service schemes (whether for delay or bandwidth guarantees) assume that you have an output-queued switch

Link 1, ingress Link 1, egress

Link 2, ingress Link 2, egress

Link 3, ingress Link 3, egress

Link 4, ingress Link 4, egress

Page 34: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

34COMP680E by M. Hamdi

VOQ Algorithms and Delay

• WHY: Because an OQ switch has no “fabric” scheduling/arbitration algorithm. Delay simply depends on traffic statistics Researchers have shown that you can provide a lot of

QoS algorithms (like WFQ) using a single server and based on the traffic statistics

• But, OQ switches are extremely expensive to build– Memory bandwidth requirement is very high

– These QoS scheduling algorithms have little practical significance for scalable and high-performance switches/routers.

Page 35: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

35COMP680E by M. Hamdi

Output QueueingThe “ideal”

1

1

1

1

1

1

1

1

1

11

1

2

2

2

2

2

2

Page 36: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

36COMP680E by M. Hamdi

How to get good delay cheaply?

• Enter speedup…– The fabric speedup for an IQ switch equals 1 (mem. bwdth. = 2)

– The fabric speedup for an OQ switch equals N (mem. Bwdth. = N+1)

– Suppose we consider switches with fabric speedup of S, 1 < S << N

– Such switch will require buffers both at the input and the output

call these combined input- and output-queued (CIOQ) switches

• Such switches could help if…– With very small values of S

– We get the performance – both delay and throughput – of an OQ switch

Page 37: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

37COMP680E by M. Hamdi

A CIOQ switch

• Consist of– An (internally non-blocking, e.g. crossbar) fabric with

speedup S > 1

– Input and output buffers

– A scheduler to determine matchings

Page 38: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

38COMP680E by M. Hamdi

A CIOQ switch

• For concreteness, suppose S = 2. The operation of the switch consists of

– Transferring no more than 2 cells from (to) each input (output)

– Logically, we will think of each time slot as consisting of two phases

– Arrivals to (departures from) switch occur at most once per time slot

– The transfer of cells from inputs to outputs can occur in each phase

Page 39: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

39COMP680E by M. Hamdi

Using Speedup

1

1

1

2

2

Page 40: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

40COMP680E by M. Hamdi

Performance of CIOQ switches

• Now that we have a higher speedup, do we get a handle on delay?– Can we say something about delay (e.g., every packet

from a given flow should below 15 msec)?

– There is one way of doing this: competitive analysis the idea is to compete with the performance of an OQ

switch

Page 41: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

41COMP680E by M. Hamdi

Intuition

Speedup = 1

Speedup = 2

Fabric throughput = .58

Fabric throughput = 1.16

Ave I/p queue = 6.25

Ave I/p queue = too large

Page 42: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

42COMP680E by M. Hamdi

Intuition (continued)

Speedup = 3

Fabric throughput = 1.74

Speedup = 4

Fabric throughput = 2.32

Ave I/p queue = 0.75

Ave I/p queue = 1.35

Page 43: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

43COMP680E by M. Hamdi

Performance of CIOQ switches

• The setup

– Under arbitrary, but identical inputs (packet-by-packet)

– Is it possible to replace an OQ switch by a CIOQ switch and schedule the CIOQ switch so that the outputs are identical packet-by-packet? To exactly mimick an OQ switch

– If yes, what is the scheduling algorithm?

Page 44: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

44COMP680E by M. Hamdi

What is exact mimicking?

Apply same inputs to an OQ and a CIOQ switch- packet by packet

Obtain same outputs- packet by packet

Page 45: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

45COMP680E by M. Hamdi

Why is a speedup of N not necessary?

It is useless to bring all packets to the output if they need wait at the output.

Need to bring packets at the output before they can leave.

What is exact mimicking?

Page 46: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

46COMP680E by M. Hamdi

Consequences

• Suppose, for now, that a CIOQ is competitive wrt an OQ switch. Then

– We get perfect emulation of an OQ switch

– This means we inherit all its throughput and delay properties

– Most importantly – all QoS scheduling algorithms originally designed for OQ switches can be directly used on a CIOQ switch

– But, at the cost of introducing a scheduling algorithm – which is the key

Page 47: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

47COMP680E by M. Hamdi

Emulating OQ Switches with CIOQ

• Consider an N x N switch with (integer) speedup S > 1– We’re going to see if this switch can emulate an OQ switch

• We’ll apply the same inputs, cell-by-cell, to both switches– We’ll assume that the OQ switch sends out packets in FIFO order

– And we’ll see if the CIOQ switch can match cells on the output side

Page 48: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

48COMP680E by M. Hamdi

Key concept: Urgency

Urgency of a cell at any time = its departure time - current time• It basically indicates the time that this packet will depart the

OQ switch• This value is decremented after each time slot• When the value reaches 0, it must depart (it is at the HoL of

the output queues)

OQ switchOQ switch

Page 49: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

49COMP680E by M. Hamdi

Key concept: Urgency

• Algorithm: Most urgent cell first (MUCF). In each “phase”

1. Outputs try to get their most urgent cells from inputs.

2. Input grant to output whose cell is most urgent.

In case of ties, output i takes priority over output i + k.

3. Loser outputs try to obtain their next most urgent cell from another (unmatched) input.

4. When no more matchings are possible, cells are transferred.

Page 50: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

50COMP680E by M. Hamdi

Key concept: Urgency - Example

• At the beginning of phase 1, both outputs 1 and 2 request input 1 to obtain their most urgent cells

• Since there is a tie, then input 1 grants to output 1 (give it to least port #).• Output 2 proceeds to get its next most urgent cell (from input 2 and has

urgency of 3)

Page 51: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

51COMP680E by M. Hamdi

Key concept: Urgency

• Observation: A cell is not forwarded from input to output for one of two (and only two) reasons…– Input contention: its input sends a more urgent cell

(output 2 cannot receive its most urgent cell in phase 1 because input 1 wants to send to output 1 a more urgent cell)

– Output contention: its output receives a more urgent cell (Input 2 cannot send its most urgent cell because output 3 wants to receive from input 3)

Page 52: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

52COMP680E by M. Hamdi

Implementing MUCF

• The way in which MUCF matches inputs to outputs is similar to the “stable marriage problem” (SMP)

• The SMP finds “stable” matchings in bipartite graphs– There are N women and N men

– Each woman (man) ranks each man (woman) in order of preference for marriage

Page 53: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

53COMP680E by M. Hamdi

The Gale-Shapley Algorithm (GSA)

• What is a stable matching?– A matching is a pairing (i, p(i)) of i with their partner

p(i)

– An unstable matching is one in which

there are matched pairs (i,p(i)) and (j, p(j)) such that

i prefers p(j) to p(i), and p(j) prefers i to j

– The GSA algorithm is guaranteed to give a stable matching

– Its complexity is O(N2)

Page 54: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

54COMP680E by M. Hamdi

An example• Consider the example we have already seen

• Executing GSA…– With men proposing we get the matching

(1, 1), (2, 4), (3, 2), (4, 3) – this takes 7 proposals (iterations)– With women proposing we get the matching

(1, 1), (2, 3), (3, 2), (4, 4) – this takes 7 proposals (iterations)– Both matchings are stable – The first is man-optimal – men get the best partners of any stable

matching– Likewise the second is woman-optimal

Page 55: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

55COMP680E by M. Hamdi

Implementing MUCF by the GSA

• MUCF can be implemented using the GSA algorithm with preference list as follows:

– Output j assigns a preference value to each input i, equal to the urgency of the cell at the HoL of VOQij

• If VOQij is empty then the preference value of input I for output j is set to infiniti

• The preference list of the output is the ordered set of its preference values for each input

– Each input assigns a preference value for each output based on the urgency of the cells, and creates the preference list accordingly

Page 56: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

56COMP680E by M. Hamdi

Theorem

• A CIOQ switch with a speedup of 4 operating under the MUCF algorithm exactly matches cells with FIFO output-queued switch.

• This is true even for Non-FIFO OQ scheduling schemes (e.g., WFQ, strict priority, etc.)

• We can achieve similar results with S = 2

Page 57: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

57COMP680E by M. Hamdi

Implementation - a closer look

Main sources of difficulty

- Estimating urgency

- Matching process - too many iterations?

Estimating urgency depends on what is being emulated

- FIFO, Strict priorities - no problem

- WFQ, etc - problems

(and communicating this info among I/ps and O/ps)

Page 58: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

58COMP680E by M. Hamdi

Other Work

Relax stringent requirement of exact emulation

- Least Occupied O/p First Algorithm (LOOFA)

- Can provide QoS

Keeps outputs always busy if there are packets

• A lot of work has been done using this directionA lot of work has been done using this direction

Conclusion: We must have a speedup if we want to Conclusion: We must have a speedup if we want to approach the performance of OQ switches, or approach the performance of OQ switches, or

provide QoSprovide QoS

Page 59: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

59COMP680E by M. Hamdi

QoS Scheduling Algorithms

Page 60: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

60COMP680E by M. Hamdi

QoS Differentiation: Two options

• Stateful (per flow)

• IETF Integrated Services (Intserv)/RSVP

• Stateless (per class)

• IETF Differentiated Services (Diffserv)

Page 61: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

61COMP680E by M. Hamdi

The Building Blocks: May contain more functions

• Classifier

• Shaper

• Policer

• Scheduler

• Dropper

Page 62: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

62COMP680E by M. Hamdi

QoS Mechanisms• Admission Control

– Determines whether the flow can/should be allowed to enter the network. • Packet Classification

– Classifies the data based on admission control for desired treatment through the network

• Traffic Policing– Measures the traffic to determine if it is out of profile. Packets that are determined

to be out-of-profile can be dropped or marked differently (so they may be dropped later if needed)

• Traffic Shaping– Provides some buffering, therefore delaying some of the data, to make sure the

traffic fits into the profile (may only effect bursts or all traffic to make it similar to Constant Bit Rate)

• Queue Management– Determines the behavior of data within a queue. Parameters include queue

depth, drop policy

• Queue Scheduling– Determines how different queues empty onto the outbound link

Page 63: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

63COMP680E by M. Hamdi

QoS Router

Policer

Policer

Classifier

Policer

Policer

Classifier

Per-flow Queue

Scheduler

Per-flow Queue

Per-flow Queue

Scheduler

Per-flow Queue

shaper

shaper

Queue management

Page 64: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

64COMP680E by M. Hamdi

Queue Scheduling Algorithms

Page 65: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

COMP680E by M. Hamdi

Scheduling at the output link of an OQ Switch

• Sharing always results in contention• A scheduling discipline resolves contention:

• Decide when and what packet to send on the output link– Usually implemented at output interface – Scheduling is a Key to fairly sharing resources and providing performance guarantees

Link 1, ingress Link 1, egress

Link 2, ingress Link 2, egress

Link 3, ingress Link 3, egress

Link 4, ingress Link 4, egress

Page 66: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

66COMP680E by M. Hamdi

Output Scheduling

scheduler

Allocating output bandwidthControlling packet delay

Page 67: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

67COMP680E by M. Hamdi

Types of Queue Scheduling

• Strict Priority– Empties the highest priority non-empty queue first, before servicing lower

priority queues. It can cause starvation of lower priority queues.

• Round Robin– Services each queue by emptying a certain amount of data and then going to

the next queue in order.

• Weighted Fair Queuing (WFQ)– Empties an amount of data from a queue based on the relative weight for the

queue (driven by reserved bandwidth) before servicing the next queue.

• Earliest Deadline First– Determines the latest time a packet must leave to meet the delay requirements

and service the queues in that order

Page 68: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

68COMP680E by M. Hamdi

Scheduling: Deterministic Priority

• Packet is served from a given priority level only if no packets exist at higher levels (multilevel priority with exhaustive service)

• Highest level gets lowest delay• Watch out for starvation!• Usually map priority levels to delay classes

Low bandwidth urgent messages

Realtime

Non-realtime

Priority

Page 69: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

69COMP680E by M. Hamdi

Scheduling: Work conserving vs. non-work-conserving

• Work conserving discipline is never idle when packets await service

• Why bother with non-work conserving? (sometimes useful for example to minimize delay jitter)

Page 70: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

COMP680E by M. Hamdi

Scheduling: Requirements• An ideal scheduling discipline

– is easy to implement (preferably in hardware)

– is fair (each connection gets no more than what it wants.

The excess, if any, is equally shared)

– provides performance bounds (Can be deterministic or statistical) Common parameters are

• bandwidth

• delay

• delay-jitter

• Loss

– allows easy admission control decisions (Choice of scheduling discipline affects ease of admission control algorithm)

• to decide whether a new flow can be allowed

Page 71: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

71COMP680E by M. Hamdi

Scheduling: No Classification

FIFO

First come first serve

• This is the simplest possible. But we cannot provide any guarantees.This is the simplest possible. But we cannot provide any guarantees.

• With FIFO queues, if the depth of the queue is not bounded, there very little With FIFO queues, if the depth of the queue is not bounded, there very little that can be donethat can be done

• We can perform preferential droppingWe can perform preferential dropping

• We can use other service disciplines on a single queue (e.g., EDF)We can use other service disciplines on a single queue (e.g., EDF)

Page 72: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

72COMP680E by M. Hamdi

Scheduling: Class Based Queuing

• At each output port, packets of the same class are queued at distinct queues.

• Service disciplines within each queue can vary (e.g., FIFO, EDF, etc.). Usually it is FIFO

• Service disciplines between classes can vary as well (e.g., strict priority, some kind of sharing, etc.)

Class 1

Class 2

Class 3

Class 4

Class based scheduling

Page 73: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

73COMP680E by M. Hamdi

Per Flow Packet Scheduling• Each flow is allocated a separated “virtual queue”

– Lowest level of aggregation

– Service disciplines between the flows vary (FIFO, SP, etc.)

1

2

Scheduler

flow 1

flow 2

flow n

Classifier

Buffer management

Page 74: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

74COMP680E by M. Hamdi

The problems caused by FIFO queues in routers

1. In order to maximize its chances of success, a source has an incentive to maximize the rate at which it transmits.

2. (Related to #1) When many flows pass through it, a FIFO queue is “unfair” – it favors the most greedy flow.

3. It is hard to control the delay of packets through a network of FIFO queues.

Fair

ness

Dela

y

Guara

nte

es

Page 75: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

75COMP680E by M. Hamdi

Round Robin (RR)

• RR avoids starvation• All sessions have the same weight and the same packet

length:

A: B: C:

Round #2

Round #1

Page 76: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

76COMP680E by M. Hamdi

RR with variable packet length

A: B: C:

Round #1 Round #2

But the Weights are equal !!!

Page 77: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

77COMP680E by M. Hamdi

Solution…

A: B: C:

#1 #2 #3

#4

Page 78: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

78COMP680E by M. Hamdi

Weighted Round Robin (WRR)

WA=3 WB=1 WC=4

#1

round length = 8

#2

Page 79: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

79COMP680E by M. Hamdi

WRR – non Integer weights

WA=1.4 WB=0.2 WC=0.8

WA=7 WB=1 WC=4

Normalize

round length = 13

Page 80: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

80COMP680E by M. Hamdi

Weighted round robin

• Serve a packet from each non-empty queue in turn– Can provide protection against starvation

– It is easy to implement in hardware

• Unfair if packets are of different length or weights are not equal

• What is the Solution?• Different weights, fixed packet size

– serve more than one packet per visit, after normalizing to obtain integer weights

Page 81: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

81COMP680E by M. Hamdi

Problems with Weighted Round Robin

• Different weights, variable size packets– normalize weights by mean packet size

• e.g. weights {0.5, 0.75, 1.0}, mean packet sizes {50, 500, 1500}

• normalize weights: {0.5/50, 0.75/500, 1.0/1500} = { 0.01, 0.0015, 0.000666}, normalize again {60, 9, 4}

• With variable size packets, need to know mean packet size in advance

• Fairness is only provided at time scales larger than the schedule

Page 82: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

82COMP680E by M. Hamdi

Fairness

1.1 Mb/s

10 Mb/s

100 Mb/s

A

B

R1C

0.55Mb/s

0.55Mb/s

What is the “fair” allocation: (0.55Mb/s, 0.55Mb/s) or (0.1Mb/s, 1Mb/s)?

e.g. an http flow with a given(IP SA, IP DA, TCP SP, TCP DP)

Page 83: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

83COMP680E by M. Hamdi

Fairness

1.1 Mb/s

10 Mb/s

100 Mb/s

A

B

R1 D

What is the “fair” allocation?0.2 Mb/sC

Page 84: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

84COMP680E by M. Hamdi

Max-Min Fairness

The min of the flows should be as large as possible

Max-Min fairness for single resource:

Bottlenecked (unsatisfied) connections share the residual bandwidth equally

Their share is > = the share held by the connections not constrained by this bottleneck

C=10F1 = 25

F2 = 6F1’= 5

F2”= 5

Page 85: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

85COMP680E by M. Hamdi

Max-Min FairnessA common way to allocate flows

N flows share a link of rate C. Flow f wishes to send at rate W(f), and is allocated rate R(f).

1. Pick the flow, f, with the smallest requested rate.

2. If W(f) < C/N, then set R(f) = W(f).

3. If W(f) > C/N, then set R(f) = C/N.

4. Set N = N – 1. C = C – R(f).

5. If N>0 goto 1.

Page 86: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

86COMP680E by M. Hamdi

1W(f1) = 0.1

W(f3) = 10R1

C

W(f4) = 5

W(f2) = 0.5

Max-Min FairnessAn example

Round 1: Set R(f1) = 0.1

Round 2: Set R(f2) = 0.9/3 = 0.3

Round 3: Set R(f4) = 0.6/2 = 0.3

Round 4: Set R(f3) = 0.3/1 = 0.3

Page 87: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

87COMP680E by M. Hamdi

Max-Min Fairness

• How can an Internet router “allocate” different rates to different flows?

• First, let’s see how a router can allocate the “same” rate to different flows…

Page 88: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

88COMP680E by M. Hamdi

Fair Queueing

1. Packets belonging to a flow are placed in a FIFO. This is called “per-flow queueing”.

2. FIFOs are scheduled one bit at a time, in a round-robin fashion.

3. This is called Bit-by-Bit Fair Queueing.

Flow 1

Flow NClassification Scheduling

Bit-by-bit round robin

Page 89: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

89COMP680E by M. Hamdi

Weighted Bit-by-Bit Fair Queueing

Likewise, flows can be allocated different rates by servicing a different number of bits for each flow

during each round.

1R(f1) = 0.1

R(f3) = 0.3R1

C

R(f4) = 0.3

R(f2) = 0.3

Order of service for the four queues:… f1, f2, f2, f2, f3, f3, f3, f4, f4, f4, f1,…

Also called “Generalized Processor Sharing (GPS)”

Page 90: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

90COMP680E by M. Hamdi

Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Equal Weights

Weights : 1:1:1:1

1

1

1

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

Time

1

1

1

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A1B1C1D1

A2 = 2

C3 = 2

Weights : 1:1:1:1

D1, C1 Depart at R=1A2, C3 arrive

Time

Round 1

Weights : 1:1:1:1

1

1

1

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A1B1C1D1

A2 = 2

C3 = 2

A1B1C2D2

C2 Departs at R=2Time

Round 1Round 2

Page 91: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

91COMP680E by M. Hamdi

Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Equal Weights

Weights : 1:1:1:1

1

1

1

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A1B1C1D1

A2 = 2

C3 = 2

A1B1C2D2

D2, B1 Depart at R=3

A1B1C3D2

Time

Round 1Round 2Round 3

Weights : 1:1:1:1

Weights : 1:1:1:1

1

1

1

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1C3 = 2 C1 = 1

C1D1C2B1B1B1D2D2A 1A1A 1A 1

A2 = 2

C3C3A2A2

Departure order for packet by packet WFQ: Sort by finish round of packetsTime

Sort packets

1

1

1

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A1B1C1D1

A2 = 2

C3 = 2

A1B1C2D2

A1 Depart at R=4

A1B1C3D2A1C3A2A2

Time

Round 1Round 2Round 3Round 4

C3,A2 Departs at R=6

56

Page 92: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

92COMP680E by M. Hamdi

Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Weights 3:2:2:1

Weights : 3:2:2:1

3

2

2

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

Time

3

2

2

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A1A1A1B1

A2 = 2

C3 = 2

Time

Weights : 3:2:2:1

Round 1

3

2

2

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A1A1A1B1

A2 = 2

C3 = 2

D1, C2, C1 Depart at R=1Time

B1C1C2D1

Weights : 3:2:2:1

Round 1

Page 93: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

93COMP680E by M. Hamdi

Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Weights 3:2:2:1

Weights : 3:2:2:1

3

2

2

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A2 = 2

C3 = 2

B1, A2 A1 Depart at R=2Time

A1A1A1B1B1C1C2D1A1A2A2B1

Round 1Round 2

Weights : 3:2:2:1

3

2

2

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1 C1 = 1

A2 = 2

C3 = 2

D2, C3 Depart at R=2Time

A1A1A1B1B1C1C2D1A1A2A2B1C3C3D2D2

Round 1Round 23

Weights : 1:1:1:1

Weights : 3:2:2:1

3

2

2

1

6 5 4 3 2 1 0

B1 = 3

A1 = 4

D2 = 2 D1 = 1

C2 = 1C3 = 2 C1 = 1

C1C2D1A1A1A1A1A2A2B1B 1B1

A2 = 2

C3C3D2D2

Departure order for packet by packet WFQ: Sort by finish time of packetsTime

Sort packets

Page 94: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

94COMP680E by M. Hamdi

Packetized Weighted Fair Queueing (WFQ)

Problem: We need to serve a whole packet at a time.

Solution: 1. Determine what time a packet, p, would complete if we served

it flows bit-by-bit. Call this the packet’s finishing time, Fp.

2. Serve packets in the order of increasing finishing time.

Also called “Packetized Generalized Processor Sharing (PGPS)”

Page 95: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

95COMP680E by M. Hamdi

WFQ is complex

• There may be hundreds to millions of flows; the linecard needs to manage a FIFO queue per each flow.

• The finishing time must be calculated for each arriving packet,

• Packets must be sorted by their departure time. • Most efforts in QoS scheduling algorithms is to come up

with practical algorithms that can approximate WFQ!

1

2

3

N

Packets arriving to egress linecard

CalculateFp

Find Smallest Fp

Departing packet

Egress linecard

Page 96: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

96COMP680E by M. Hamdi

When can we Guarantee Delays?

• Theorem

If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.

Page 97: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

97COMP680E by M. Hamdi

time

Cumulativebytes

A(t)D(t)

R

B(t)

Deterministic analysis of a router queueFIFO case

FIFO delay, d(t)

RA(t) D(t)

Model of router queue

B(t)

Page 98: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

98COMP680E by M. Hamdi

Flow 1

Flow NClassificationWFQ

Scheduler

A1(t)

AN(t)

R(f1), D1(t)

R(fN), DN(t)

time

Cumulativebytes

A1(t) D1(t)

R(f1)

Key idea: In general, we don’t

know the arrival process. So let’s constrain it.

Page 99: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

99COMP680E by M. Hamdi

Let’s say we can bound the arrival process

time

Cumulativebytes

t

Number of bytes that can arrive in any period of length t

is bounded by:

This is called “() regulation”

A1(t)

Page 100: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

100COMP680E by M. Hamdi

The leaky bucket “()” regulator

Tokensat rate,

Token bucketsize,

Packet buffer

Packets Packets

One byte (or packet) per

token

Page 101: Can we make these scheduling algorithms simpler? Using a Simpler Architecture

101COMP680E by M. Hamdi

() Constrained Arrivals and Minimum Service Rate

time

Cumulativebytes

A1(t) D1(t)

R(f1)

dmax

Bmax

Theorem [Parekh,Gallager ’93]: If flows are leaky-bucket constrained,and routers use WFQ, then end-to-end delay guarantees are possible.

1 1

.

( ) , ( ) / ( ).

For no packet loss,

I f then

B

R f d t R f