Upload
valentine-robinson
View
213
Download
1
Embed Size (px)
Citation preview
Block Design Review:
Queue Manager and Scheduler
Amy M. FreestoneSailesh Kumar
2 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Overview
QM/Scheduler» Function:
– Enqueue and Dequeue from queues– Scheduling algorithm (5-ports, N queue per port, WDRR across queues)– Drop Policy– RR port scheduling, rate controlled
» Memory Accesses:– SRAM:
Q-Array Reads and Writes Scheduling Data Structure Reads and Writes QLength Data Structure Reads and Writes Queue weight, discard threshold, and port rates Reads Retrieve Packet Length from Buffer Descriptor Reads
LookupPhy IntRx
SwitchTxQM/SchdKey
ExtractHdr
Format
SWITCH
Frame Length (16b)
Buffer Handle(32b)
Stats Index (16b)
QID(20b)Rsv(4b)
Buffer Handle(24b)Rsv(3b)
Port(4b)
V1
V: Valid Bit
Rsv(4b)
Port(4b)
3 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Data Structures
Frame Length (16b)
Buffer Handle(32b)
Stats Index (16b)
QID(20b)Rsv(4b) Buffer Handle(24b)Rsv
(3b)Port(4b)
V1
Rsv(4b)
Port(4b)
Queue id (20b)Queue length
QID(20b) TailValid
QlenValid
HeadValid
CAM(16 entries)
Discard thresholdWeight quantum
::
Local memory (16 entries)
Queue head/tail/count
SRAM Q-array(16 entries)
::
High level Cache Arch.
Queue length
Discard thresholdWeight quantum
::
Q params(Per queue)
Head
TailCount
::
Q Descrpt.(Per queue)
xxxLW0-1
LW2
xxxLW3-7
Pkt_Size (16b) xxx
Buf. Descrpt.
SRAM
Enqueuer
DequeuerDequeuerDequeuerDequeuerDequeuer
4 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Interface
Scratch Ring Interface» For both ingress and egress
Threads used: 7» Thread 0: Free list maintenance and initialization» Thread 1-5: Dequeue for port 0-4» Thread 6: Enqueue for all 5 ports
Threads are synchronized after each round» A round enqueues up to 5 packets» Dequeues up to 5 packets, one for each port
LookupPhy IntRx
SwitchTxQM/SchdKey
ExtractHdr
Format
SWITCH
Frame Length (16b)
Buffer Handle(32b)
Stats Index (16b)
QID(20b)Rsv(4b)
Buffer Handle(24b)Rsv(3b)
Port(4b)
V1
V: Valid Bit
Rsv(4b)
Port(4b)
5 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Thread Synchronization
T hread 0 T h read 1 T h read 2 T h read 3 T h read 4 T h read 5 T h read 6
F ree b u ffe r Dequeuer por t 0 Dequeuer por t 1 Dequeuer por t 2 Dequeuer por t 3 Dequeuer por t 4 Enqu eue r
w ait 0 .1, ... 0 .5 w ait 1.A w ait 2.A w ait 3.A w ait 4.A w ait 5.A w ait 6.A
::
se t 1.A, ... 6 .A
d e qu e u e::
w ait 1.Bw r ite _old_tail()
d e qu e u e::
w ait 2.Bw r ite _old_tail()
d e qu e u e::
w ait 3.Bw r ite _old_tail()
d e qu e u e::
w ait 4.Bw r ite _old_tail()
d e qu e u e::
w ait 5.Bw r ite _old_tail()
::
e nque uese t 1.B, ..., 5.B
se t 0.1 se t 0.2 se t 0.3 se t 0.4 se t 0.5
D u rin g in it ia liza t io n s e t s ig n a ls 1.A, .... 6 .A
Note that in the enqueue thread, signal A is not used, it is implementedUsing a register which is set by thread 0 and reset by enqueuer
6 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Resource Usage
Local memory: 1512 bytes» #define PAR_CACHE_LM_BASE 0x0» #define PORT_DATA_LM_BASE 0x100» #define BBUF_FL_LM_BASE 0x1a8» #define BBUF_LM_BASE 0x1fc» #define FL_LM_BASE 0x598
SRAM» Queue descriptors (16B per queue)» Queue parameters (16B per queue)» Port rates (4B per port)» Free lists» Batch buffers
Enqueue:» 15 signals, 16 RD xfer, 10 WR xfer
Dequeue:» 9 signals, QM uses 4 RD xfer, 1 WR xfer. SCH used more xfers
7 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Local Memory Map (JDD, 4/1/08)
PAR Cache
Port Data
Batch Buffers(21 * 44Bytes)
Free List(>=40 * 4Bytes)
0x000
0x9FF
0x100
0x1FC
0x598
Batch Buf FL0x1A80x1A7
0x1FB
0x597
0x680Port Rate Control Data
0x690Unallocated
residualResult written here
Port Data Structure:» 0: Old Tail LM» 1: Old Tail SRAM» 2: head SRAM» 3: tail SRAM» 4: tail offset (first empty slot)» 5: nexthead LM» 6: LM (head|tail)» 7: unused
8 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Data Consistency Precautions
Only one thread (dequeue or enqueue) reads in the queue parameters of a Queue» Flags are used to ensure that when thread x is reading in the Q
param– thread y doesn’t read them– Also, thread y waits until thread x stores the data read into cache
» Flags are stored in local memory– Three flags are used, (head valid, tail valid, and Q param valid)– Head valid implies dequeue thread has cached the Q descriptor– Tail valid implies enqueue thread has cached the Q descriptor– Both valid means, both head and tail are cached
Before a thread swaps out» Move relevant register contents (flags, queue length) into the local
memory After a thread resumes
» Move relevant local memory data back to register Cache contents are refreshed after every 4k iterations Port rate in register are refreshed every 4k iterations
9 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Initialization
Thread 0 initializes all shared data-structure ???» CAM and Q-array (cam_clear and Q-array empty)» Memory controller variables
– Set SRAM Channel CSR to ignore cellcount and eop bit in the buffer handle
» Local memory– Queue parameter cache (all zeroes)– Scheduling data structures (set by scheduler)
» SRAM– Queue parameters (length, weight quantum, discard threshold)– Queue descriptors (all zeroes)– Port rates (as per token bucket)– Free list (set by free list macro)– Scheduling data structure (set by scheduler)
10 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Enqueue Thread
Operates in batch mode (5 packets at a time)» Read 5 requests from the scratch ring» Check CAM for the 5 queue ids read» If miss
– Evict LRU entry (write back queue params and descr)– Read queue params from SRAM into cache– Read queue descriptor into Q-array– Update CAM
» check for discard– If discard, call dl_drop_buf
» If admit– Send enqueue command to Q-array– Check if queue was already active
If not call add_queue_to_tail– Update the queue length in cache– Write back queue length (in future may want to do less often)
11 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Dequeue Thread (per port)
One thread handles one port» Done for the round if port rate $$tx_q_flow_control is set or
port is inactive (port_active macro) or tokens are over» If current batch is done, call get_head macro» If batch buffer is non-empty then consider the first queue_id
– Check CAM for the queue_id– If miss
Evict LRU entry (write back queue params and descr) Read queue params from SRAM into cache Read queue descriptor into Q-array and Update CAM
– If Hit or after data is ready Send dequeue command to Q-array Call dl_sink_1ME_SCR_1words
– Read the pkt_length from buffer descriptor– Update queue length (and write back) and the credit
If credit <= 0 and queue_length > 0 then add_queue_to_tail If queue_length <= 0 OR credit <= 0 then incr. batch_index If batch_index = 5 OR queue_id = 0 then call advance_head
12 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Enqueue Thread
Read 15 words from scratch
For 5 q_ids, check CAM hit:
If miss, write back LRU and read queue param/descriptor
Admit?
enqueue / update Q params
Active?
add_queue_to_tail() (x instr)
Write back the queue length
28 inst.
40/31 inst. per Q202/157 inst. total
Per packet41 if discardIf admit:62+add_q_2_tail
Total205 / 310+5x
+ 6 inst. for signals
For all 5 requests:Worst case: 545+5xAll discard: 395All accept/hit: 500+5x
2x5 Writes1, 3 words
2x5 Reads3, 2 words
2x5 Writes1, 1 word
SCH reads
dl_drop_buf()
Loop around
x = 18-49
13 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Dequeue Thread (per port)
Rate_control
If curr_queue = 0, get_head()
Update cache, dequeue
27 inst.
27 inst.
34 inst.
Worst case: 320
Best: 170
Check CAM, evict, load 32/44 inst.
24+ inst.
Send tx_msg, read pkt_len
13 inst.Update credit/q_len, Wr q_len
Adv_head: 35-63 instAdd_queu..: 18-49 instOverheads: 13 inst
add_queue_to_tail() advance_head()
1 Read(once / 16K cycles)
2 Writes1, 3 words2 Reads3, 2 words
1 Read
1 Read
1 Write
Write_old_tail and loop around
14 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Dequeue Rate Control (Updated by JDD)
Token bucket» The unit of port_rate is bytes per 4096 clocks (ME clock/16 MHz).» curr_time is the counts of 16 clocks (ME clock/16 MHz).» last_time is the time when the last packet was sent.» IF PORT IS INAVTIVE THEN tokens = 4095» ELSE IF (tokens = 4095)
– SEND PACKET– last_time := curr_time– tokens = tokens pkt_length
» ELSE– result = ((curr_time – last_time) x port_rate) + residualReslt // 16 x 16 multiply– residualResult = (result <<22) >> 22 // save bits shifted out to add back in next time– Tokens = min [ 4095, tokens + (result >> 10) ]– IF (tokens > 0)
SEND PACKET last_time := curr_time tokens = tokens pkt_length
Port rates» Must be specified in LSB 16-bits» 1 unit = 683 Kbps» Max port rate = 64K = 44.8 Gbps
Reserved (16b) Port rate (16b)
15 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Performance Analysis
Dequeue thread runs much longer than the enqueue thread» Dequeue
– 1273 cycles in case of a cache miss and add_queue_to_tail() and advance_head()
– 867 cycles in case of cache hit and no scheduler calls» Enqueue
– 876 cycles in case of all 5 cache misses– 342 cycles in case of a single enqueue and cache hit
Dequeue takes more time due to memory accesses» Read Queue_param: 110 cycles» Dequeue: 120 cycles» Read pkt_len: 110 cycles
There are few idle cycles at present» Can be removed by giving higher priority to dequeue threads
16 - Amy M. Freestone, Sailesh Kumar - 04/18/23
File locations (in …/IPv4_MR/)
Code» src/qm/PL/common_macros.uc» src/qm/PL/dequeue.uc» src/qm/PL/enqueue.uc» src/qm/PL/fl_macros.uc» src/qm/PL/qm.h» src/qm/PL/qm.uc» src/qm/PL/sched_macros.h
Includes» ../dispatch_loop/dl_source_WU.uc
– dl_buf_drop() and dl_sink_1ME_SCR_1words() functions» Also uses local memory read and write macros (localmem.uc)
17 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Queue Manager Validation
Tested» Threshold length discards (set length at 0, and tested if
packets are enqueued)» Enqueue
– Single port, single queue active– Multiple ports/queues active– Cache hit/miss (not all scenarios are tested)
» Dequeue– Rate control partially tested (set the port rate at 0, and see is
packet are dequeued)– Partial fairness test (set quantum at 0, and see if packets are
dequeued)– Multiple active ports/queues
Both queue manager enabled» There is one bug concerning the Q-array contention
18 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Cycle Budget» 76B packet» 1.4 Ghz clock rate» 1.4Gcycle/sec» % Gbps => 170 cycles per packet
– Dequeue worst-case = 320 inst. (best case 170 inst.)– Dequeue worst-case = 545 + 5x inst. for 5 packets
19 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Scheduling Structure Overview
Batch Buffer Batch Buffer Batch Buffer
Batch Buffer Batch Buffer Batch Buffer
Port 0
Port 4
Head Next Head Tail
… … ……
SRAM Next Pointer
Queue 0
Credits 0
Queue 4
Credits 4
…
Batch Buffer
Batch Buffers
in SRAM
Batch Buffers
in SRAM
Stack inLocal Memory
Stack inSRAM
Free List (for SRAM Batch Buffers)
Stack inLocal Memory
Batch Buffer Free List(for LM Batch Buffers)
20 - Amy M. Freestone, Sailesh Kumar - 04/18/23
Scheduling Structure Interface
Scheduling structure macros contained in \src\qm\PL\sched_macros.uc»add_queue_to_tail(queue, credits, port)»get_head(port, head_ptr)»advance_head(port, sig_a, sig_b)»port_active(port, label)»write_old_tail(port, sig_a, sig_b)
Free list macro contained in\src\qm\PL\fl_macros.uc»maintain_fl()