20
Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

Embed Size (px)

Citation preview

Page 1: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

Block Design Review:

Queue Manager and Scheduler

Amy M. FreestoneSailesh Kumar

Page 2: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

2 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Overview

QM/Scheduler» Function:

– Enqueue and Dequeue from queues– Scheduling algorithm (5-ports, N queue per port, WDRR across queues)– Drop Policy– RR port scheduling, rate controlled

» Memory Accesses:– SRAM:

Q-Array Reads and Writes Scheduling Data Structure Reads and Writes QLength Data Structure Reads and Writes Queue weight, discard threshold, and port rates Reads Retrieve Packet Length from Buffer Descriptor Reads

LookupPhy IntRx

SwitchTxQM/SchdKey

ExtractHdr

Format

SWITCH

Frame Length (16b)

Buffer Handle(32b)

Stats Index (16b)

QID(20b)Rsv(4b)

Buffer Handle(24b)Rsv(3b)

Port(4b)

V1

V: Valid Bit

Rsv(4b)

Port(4b)

Page 3: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

3 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Data Structures

Frame Length (16b)

Buffer Handle(32b)

Stats Index (16b)

QID(20b)Rsv(4b) Buffer Handle(24b)Rsv

(3b)Port(4b)

V1

Rsv(4b)

Port(4b)

Queue id (20b)Queue length

QID(20b) TailValid

QlenValid

HeadValid

CAM(16 entries)

Discard thresholdWeight quantum

::

Local memory (16 entries)

Queue head/tail/count

SRAM Q-array(16 entries)

::

High level Cache Arch.

Queue length

Discard thresholdWeight quantum

::

Q params(Per queue)

Head

TailCount

::

Q Descrpt.(Per queue)

xxxLW0-1

LW2

xxxLW3-7

Pkt_Size (16b) xxx

Buf. Descrpt.

SRAM

Enqueuer

DequeuerDequeuerDequeuerDequeuerDequeuer

Page 4: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

4 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Interface

Scratch Ring Interface» For both ingress and egress

Threads used: 7» Thread 0: Free list maintenance and initialization» Thread 1-5: Dequeue for port 0-4» Thread 6: Enqueue for all 5 ports

Threads are synchronized after each round» A round enqueues up to 5 packets» Dequeues up to 5 packets, one for each port

LookupPhy IntRx

SwitchTxQM/SchdKey

ExtractHdr

Format

SWITCH

Frame Length (16b)

Buffer Handle(32b)

Stats Index (16b)

QID(20b)Rsv(4b)

Buffer Handle(24b)Rsv(3b)

Port(4b)

V1

V: Valid Bit

Rsv(4b)

Port(4b)

Page 5: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

5 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Thread Synchronization

T hread 0 T h read 1 T h read 2 T h read 3 T h read 4 T h read 5 T h read 6

F ree b u ffe r Dequeuer por t 0 Dequeuer por t 1 Dequeuer por t 2 Dequeuer por t 3 Dequeuer por t 4 Enqu eue r

w ait 0 .1, ... 0 .5 w ait 1.A w ait 2.A w ait 3.A w ait 4.A w ait 5.A w ait 6.A

::

se t 1.A, ... 6 .A

d e qu e u e::

w ait 1.Bw r ite _old_tail()

d e qu e u e::

w ait 2.Bw r ite _old_tail()

d e qu e u e::

w ait 3.Bw r ite _old_tail()

d e qu e u e::

w ait 4.Bw r ite _old_tail()

d e qu e u e::

w ait 5.Bw r ite _old_tail()

::

e nque uese t 1.B, ..., 5.B

se t 0.1 se t 0.2 se t 0.3 se t 0.4 se t 0.5

D u rin g in it ia liza t io n s e t s ig n a ls 1.A, .... 6 .A

Note that in the enqueue thread, signal A is not used, it is implementedUsing a register which is set by thread 0 and reset by enqueuer

Page 6: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

6 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Resource Usage

Local memory: 1512 bytes» #define PAR_CACHE_LM_BASE 0x0» #define PORT_DATA_LM_BASE 0x100» #define BBUF_FL_LM_BASE 0x1a8» #define BBUF_LM_BASE 0x1fc» #define FL_LM_BASE 0x598

SRAM» Queue descriptors (16B per queue)» Queue parameters (16B per queue)» Port rates (4B per port)» Free lists» Batch buffers

Enqueue:» 15 signals, 16 RD xfer, 10 WR xfer

Dequeue:» 9 signals, QM uses 4 RD xfer, 1 WR xfer. SCH used more xfers

Page 7: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

7 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Local Memory Map (JDD, 4/1/08)

PAR Cache

Port Data

Batch Buffers(21 * 44Bytes)

Free List(>=40 * 4Bytes)

0x000

0x9FF

0x100

0x1FC

0x598

Batch Buf FL0x1A80x1A7

0x1FB

0x597

0x680Port Rate Control Data

0x690Unallocated

residualResult written here

Port Data Structure:» 0: Old Tail LM» 1: Old Tail SRAM» 2: head SRAM» 3: tail SRAM» 4: tail offset (first empty slot)» 5: nexthead LM» 6: LM (head|tail)» 7: unused

Page 8: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

8 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Data Consistency Precautions

Only one thread (dequeue or enqueue) reads in the queue parameters of a Queue» Flags are used to ensure that when thread x is reading in the Q

param– thread y doesn’t read them– Also, thread y waits until thread x stores the data read into cache

» Flags are stored in local memory– Three flags are used, (head valid, tail valid, and Q param valid)– Head valid implies dequeue thread has cached the Q descriptor– Tail valid implies enqueue thread has cached the Q descriptor– Both valid means, both head and tail are cached

Before a thread swaps out» Move relevant register contents (flags, queue length) into the local

memory After a thread resumes

» Move relevant local memory data back to register Cache contents are refreshed after every 4k iterations Port rate in register are refreshed every 4k iterations

Page 9: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

9 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Initialization

Thread 0 initializes all shared data-structure ???» CAM and Q-array (cam_clear and Q-array empty)» Memory controller variables

– Set SRAM Channel CSR to ignore cellcount and eop bit in the buffer handle

» Local memory– Queue parameter cache (all zeroes)– Scheduling data structures (set by scheduler)

» SRAM– Queue parameters (length, weight quantum, discard threshold)– Queue descriptors (all zeroes)– Port rates (as per token bucket)– Free list (set by free list macro)– Scheduling data structure (set by scheduler)

Page 10: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

10 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Enqueue Thread

Operates in batch mode (5 packets at a time)» Read 5 requests from the scratch ring» Check CAM for the 5 queue ids read» If miss

– Evict LRU entry (write back queue params and descr)– Read queue params from SRAM into cache– Read queue descriptor into Q-array– Update CAM

» check for discard– If discard, call dl_drop_buf

» If admit– Send enqueue command to Q-array– Check if queue was already active

If not call add_queue_to_tail– Update the queue length in cache– Write back queue length (in future may want to do less often)

Page 11: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

11 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Dequeue Thread (per port)

One thread handles one port» Done for the round if port rate $$tx_q_flow_control is set or

port is inactive (port_active macro) or tokens are over» If current batch is done, call get_head macro» If batch buffer is non-empty then consider the first queue_id

– Check CAM for the queue_id– If miss

Evict LRU entry (write back queue params and descr) Read queue params from SRAM into cache Read queue descriptor into Q-array and Update CAM

– If Hit or after data is ready Send dequeue command to Q-array Call dl_sink_1ME_SCR_1words

– Read the pkt_length from buffer descriptor– Update queue length (and write back) and the credit

If credit <= 0 and queue_length > 0 then add_queue_to_tail If queue_length <= 0 OR credit <= 0 then incr. batch_index If batch_index = 5 OR queue_id = 0 then call advance_head

Page 12: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

12 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Enqueue Thread

Read 15 words from scratch

For 5 q_ids, check CAM hit:

If miss, write back LRU and read queue param/descriptor

Admit?

enqueue / update Q params

Active?

add_queue_to_tail() (x instr)

Write back the queue length

28 inst.

40/31 inst. per Q202/157 inst. total

Per packet41 if discardIf admit:62+add_q_2_tail

Total205 / 310+5x

+ 6 inst. for signals

For all 5 requests:Worst case: 545+5xAll discard: 395All accept/hit: 500+5x

2x5 Writes1, 3 words

2x5 Reads3, 2 words

2x5 Writes1, 1 word

SCH reads

dl_drop_buf()

Loop around

x = 18-49

Page 13: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

13 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Dequeue Thread (per port)

Rate_control

If curr_queue = 0, get_head()

Update cache, dequeue

27 inst.

27 inst.

34 inst.

Worst case: 320

Best: 170

Check CAM, evict, load 32/44 inst.

24+ inst.

Send tx_msg, read pkt_len

13 inst.Update credit/q_len, Wr q_len

Adv_head: 35-63 instAdd_queu..: 18-49 instOverheads: 13 inst

add_queue_to_tail() advance_head()

1 Read(once / 16K cycles)

2 Writes1, 3 words2 Reads3, 2 words

1 Read

1 Read

1 Write

Write_old_tail and loop around

Page 14: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

14 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Dequeue Rate Control (Updated by JDD)

Token bucket» The unit of port_rate is bytes per 4096 clocks (ME clock/16 MHz).» curr_time is the counts of 16 clocks (ME clock/16 MHz).» last_time is the time when the last packet was sent.» IF PORT IS INAVTIVE THEN tokens = 4095» ELSE IF (tokens = 4095)

– SEND PACKET– last_time := curr_time– tokens = tokens pkt_length

» ELSE– result = ((curr_time – last_time) x port_rate) + residualReslt // 16 x 16 multiply– residualResult = (result <<22) >> 22 // save bits shifted out to add back in next time– Tokens = min [ 4095, tokens + (result >> 10) ]– IF (tokens > 0)

SEND PACKET last_time := curr_time tokens = tokens pkt_length

Port rates» Must be specified in LSB 16-bits» 1 unit = 683 Kbps» Max port rate = 64K = 44.8 Gbps

Reserved (16b) Port rate (16b)

Page 15: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

15 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Performance Analysis

Dequeue thread runs much longer than the enqueue thread» Dequeue

– 1273 cycles in case of a cache miss and add_queue_to_tail() and advance_head()

– 867 cycles in case of cache hit and no scheduler calls» Enqueue

– 876 cycles in case of all 5 cache misses– 342 cycles in case of a single enqueue and cache hit

Dequeue takes more time due to memory accesses» Read Queue_param: 110 cycles» Dequeue: 120 cycles» Read pkt_len: 110 cycles

There are few idle cycles at present» Can be removed by giving higher priority to dequeue threads

Page 16: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

16 - Amy M. Freestone, Sailesh Kumar - 04/18/23

File locations (in …/IPv4_MR/)

Code» src/qm/PL/common_macros.uc» src/qm/PL/dequeue.uc» src/qm/PL/enqueue.uc» src/qm/PL/fl_macros.uc» src/qm/PL/qm.h» src/qm/PL/qm.uc» src/qm/PL/sched_macros.h

Includes» ../dispatch_loop/dl_source_WU.uc

– dl_buf_drop() and dl_sink_1ME_SCR_1words() functions» Also uses local memory read and write macros (localmem.uc)

Page 17: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

17 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Queue Manager Validation

Tested» Threshold length discards (set length at 0, and tested if

packets are enqueued)» Enqueue

– Single port, single queue active– Multiple ports/queues active– Cache hit/miss (not all scenarios are tested)

» Dequeue– Rate control partially tested (set the port rate at 0, and see is

packet are dequeued)– Partial fairness test (set quantum at 0, and see if packets are

dequeued)– Multiple active ports/queues

Both queue manager enabled» There is one bug concerning the Q-array contention

Page 18: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

18 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Cycle Budget» 76B packet» 1.4 Ghz clock rate» 1.4Gcycle/sec» % Gbps => 170 cycles per packet

– Dequeue worst-case = 320 inst. (best case 170 inst.)– Dequeue worst-case = 545 + 5x inst. for 5 packets

Page 19: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

19 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Scheduling Structure Overview

Batch Buffer Batch Buffer Batch Buffer

Batch Buffer Batch Buffer Batch Buffer

Port 0

Port 4

Head Next Head Tail

… … ……

SRAM Next Pointer

Queue 0

Credits 0

Queue 4

Credits 4

Batch Buffer

Batch Buffers

in SRAM

Batch Buffers

in SRAM

Stack inLocal Memory

Stack inSRAM

Free List (for SRAM Batch Buffers)

Stack inLocal Memory

Batch Buffer Free List(for LM Batch Buffers)

Page 20: Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

20 - Amy M. Freestone, Sailesh Kumar - 04/18/23

Scheduling Structure Interface

Scheduling structure macros contained in \src\qm\PL\sched_macros.uc»add_queue_to_tail(queue, credits, port)»get_head(port, head_ptr)»advance_head(port, sig_a, sig_b)»port_active(port, label)»write_old_tail(port, sig_a, sig_b)

Free list macro contained in\src\qm\PL\fl_macros.uc»maintain_fl()