View
6
Download
0
Category
Preview:
Citation preview
Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011
CS250 VLSI Systems DesignLecture 11: Patterns for Communication Links,
Rocket µArchitecture, Testing
John Wawrzynek, Krste Asanovic,with
John Lazzaroand
Brian Zimmer (TA)
UC BerkeleyFall 2011
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Interconnect Design Patterns
2
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Implementing Communication Queues
Queue can be implemented as centralized FIFO with single control FSM if both ends are close to each other and directly connected:
In large designs, there may be several cycles of communication latency from one end to other. This introduces delay both in forward data propagation and in reverse flow control
Control split into send and receive portions. A credit-based flow control scheme is often used to tell sender how many units of data it can send before overflowing receiver’s buffer.
3
Cntl.
Send Recv.
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
End-End Credit-Based Flow Control
For one-way latency of N cycles, need 2*N buffers at receiver to ensure full bandwidth
– Will take at least 2N cycles before sender can be informed that first unit sent was consumed (or not) by receiver
If receive buffer fills up and stalls communication, will take N cycles before first credit flows back to sender to restart flow, then N cycles for value to arrive from sender
meanwhile, receiver can work from 2*N buffered values
4
Send Recv.
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Distributed Flow Control
An alternative to end-end control is distributed flow control (chain of FIFOs)
Requires less storage, as communication flops reused as buffers, but needs more distributed control circuitry
– Lots of small buffers also less efficient than single larger buffer
Sometimes not possible to insert logic into communication path
– e.g., wave-pipelined multi-cycle wiring path, or photonic link
5
Cntl. Cntl. Cntl.
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Network PatternsConnects multiple units using shared resources
BusLow-cost, ordered
CrossbarHigh-performance
Multi-stage networkTrade cost/performance
6
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Buses
Buses were popular board-level option for implementing communication as they saved pins and wires
Less attractive on-chip as wires are plentiful and buses are slow and cumbersome with central control
Often used on-chip when shrinking existing legacy system design onto single chip
Newer designs moving to either dedicated point-point unit communications or an on-chip network
Bus Unit
7
Bus Control
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
On-Chip NetworkOn-chip network multiplexes long range wires to reduce cost
Routers use distributed flow control to transmit packets
Units usually need end-end credit flow control in addition because intermediate buffering in network is shared by all units
Router
Router Router
Router
8
Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011
Rocket µArchitecture
9
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
UCB Rocket:An In-Order RISC-V Decoupled µArchitecture
10
A family of µarchitectures supporting hardware floating-point, demand-paged virtual memory
In-order single or dual-issue, decoupled floating-point unit, precise traps
32-bit or 64-bit implementations
From 5-stage to ~9-stage pipelines
Designed to be close to commercial cores
Single-issue version will be made available in time for class projects
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
George Stephenson’s Rocket
11
“The Rocket was the most advanced steam engine of its day. It was built for the Rainhill Trials held by the Liverpool & Manchester Railway in 1829 to choose the best and most competent design. It set the standard for a hundred and fifty years of steam locomotive power. Though the Rocket was not the first steam locomotive, its claim to fame is that it was the first to bring together several innovations to produce the most advanced locomotive of its day, and the template for most steam locomotives since.” [Wikipedia]
[AllyJane, LensFlare]
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
A Simple Core?
12
=
VPC
ITLB
43
TAG
SDA
TA
I$
valid
dout
Bran
chTa
rget
Buffe
r
NPC
Che
ck
Fetch Decode
rsrt
Scor
eboa
rd(R
ead/
Set)
rsrt
rdset
busy
Deco
de,
Arbi
tratio
n,St
all
Dete
ctio
nLo
gic
Execute
ALU
IDIV
Bran
ch?
BYPA
SS
Sign
Exte
nd
imm
=
DTLB
TAG
SDA
TA
D$
Memory
Tile
Lin
k
Commit
Com
mit
Poin
tXB
AR +
Sig
n Ex
tens
ion
Misp
redi
ct?
EPC
EPC
EPC
CAUS
E
CAUS
E
CAUS
E
Exce
ptio
n?
FPU
Com
man
dQ
ueue
FPU
Inte
ger
Resp
Que
ue
HTIF
Requ
est
Que
ue
HTIF
Resp
onse
Que
ue
Pref
etch
er
Scor
eboa
rd(C
lear
)
FP
Regfi
le
(Rea
d)Sc
oreb
oard
(Rea
d/Se
t)De
code
+Ha
zard
Dete
ctio
nLo
gic
FMA
ITO
FFT
OI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load
/Sto
reAd
dr C
heck
ISDQ
mreq_data
FPU
Load
Data
Reor
der
Que
ue
busy
BYPA
SS
Decode
Floa
ting
Poin
tUn
it
RECO
DE
Execute
Scor
eboa
rd(C
lear
)
Commit
Repl
ay?
FSR
RECO
DE
FCM
P
NPCGEN
Prio
rity
Enco
der
CAUS
E
predict
predict_addr
branch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Stor
e AC
KCo
unte
r
ehpc
Ctrl
Regs
(Rea
d)
Ctrl
Regs
(W
rite)
Tim
er
ls_conflict
27
epc
eret
epc_ex
eret
miss
stall_fetch
miss
busy
exception
paddr
vaddr
rs
V V V
mreq_addr
wd0wa0 Re
gfile
we0
wd1wa1we1
ppn
data
tag
Inst
ruct
ion
Que
ue
control
st_addr
mresp_data
mreq_tag
mreq_val
mreq_rdy
EPC
FPU
Inte
ger
Ope
rand
Que
ue
Alig
ned?
dc_miss
MSH
R
V
dc_busy
to PTW
4+
busy
PTW
mresp_val
mresp_tag
mresp_datato ITLB to
DTLB
mreq_op
Tile
Link
mre
q_pt
w
D$Control
Ctrl
Regs
(Rea
d)
mreq_ptw
dc_busy
en
stall_fetch
dc_m
iss
mode
dtlb_m
iss
exception
to PTW
to FIRQ
stall
waddr
wdata
FP R
egfile
(Writ
e)
waddr
ra0 Re
gfile
(Rea
d)ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1 Ex211
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Rocket Pipeline Structure
13
Four major phases of execution
Instruction fetchGet instruction bits from I-cache
Decode, including operand fetch and issueRead register file, determine interlocks and bypass control
ExecutionPerform instruction
CommitIf no traps or interrupts, write architectural state
Each phase can contain multiple pipeline stages, but approx. one stage each in initial design.
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
F D X M C
FD FX1
FX2
FX3 FW
P Integer Pipeline
Floating-Point Pipeline
Generate Next PC
Fetch Instruction
Decode, Operand Fetch, Issue
Execute Integer ALU
Data Cache
Commit
FP Decode, Operand
Fetch, Issue
FP Execute Stages
FP Register Write
Commit Point
Rocket 5-Stage Pipeline Structure
14
P is a pseudo-stage, as contents spread over
many stages
FPU decoupling queue placed at commit point in pipeline
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
PC Generation
15
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
Next PC can come from number of sourcesPC+4 if sequential fetch (predicted not-taken)Predicted branch address (if predicted taken)Resolved branch address (if mispredicted)Replay PC (if pipeline flush/replay from either X or M stage)Trap handler address (on trap/interrupt)Restore PC (at end of trap/interrupt handler)
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
Implementing Precise TrapsHandle traps in program order at end of memory stage (the commit point)
Synchronous trap can be generated in any stage, held in Error PC & Cause shifted down pipeline
EPC always holds PC of instruction in that stage
Asynchronous interrupts handled in memory stage
Trap/interrupt flushes pipe and resets PC to handler address
RISC-V floating-point ISA designed to have no traps (only exception flags), so commit point before FPU decode
16
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
enrdata0
rdata1
rdata2
Ex1Ex2
11
Fetch Stage
Predict next PC from current PC using BTB - fed back to P stage
Fetch instructions from cache into instruction queue
Translate virtual address PC into physical address PC for I-cache physical tag check, check for illegal PC -> signal trap
I-cache miss goes to memory system, I-stream prefetcher fetches sequential blocks ahead of miss
17
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
Decode Stage
Decode instructions from queue, check for illegal ops -> signal trap
Fetch register operands and sign-extend immediate
Check for unavailable source operands using scoreboard (busy bit per register), stall decode if not available
Set busy bit for long latency operations
Calculate bypass control and mux bypass operands into ALU inputs
18
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
LogicExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
Integer Execute Stage
Most integer instructions complete in one cycle and can be bypassed to next instruction
Integer multiply takes few cycles overlapped with memory stage
Integer divide takes many cycles - so sets busy bit on destination register
Branches resolved in ALU - mispredict detected by comparing target address with EPC in following instruction (was correct path taken?)
ALU calculates load+store addresses, integer store data placed in store data queue (ISDQ)
19
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfilewe0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addrm
resp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
Memory Stage
Virtual load/store address translated and checked -> illegal address trap
Store addresses are always queued in-order in SAQ to wait for data in ISDQ or FSDQ (from FPU). Stores go-ahead when address and data available.
Loads can bypass stores if no conflict, but replayed if conflict with address in SAQ.
Non-blocking cache supports multiple outstanding primary and secondary misses.
Flushes pipe and injects handler PC if any traps or interrupts.
End of memory stage is commit point - FPU operations enqueued if no traps.
FPU load instruction enqueues command to read FPU load data queue
FPU store instruction enqueues command to send FP register value to FSDQ20
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
oryTile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
Commit Stage
Architectural registers written with final valuesBusy bits on scoreboard cleared as results arriveData cache finishes aligning and sign-extending small width values. Rocket only bypasses 32-bit and 64-bit values from end of memory stage, other sizes of load operands bypassed from end of commit stage.FPU begins decoding instructions from FPU queue.
21
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
=
VPC
ITLB
43
TAGS DATA
I$
valid dout
BranchTargetBuffer
NPC Check
FetchDecode
rs rt
Scoreboard(Read/Set)
rs rt rd set
busy
Decode,Arbitration,
StallDetection
Logic
ExecuteALU IDIVBranch?
BYPASS
SignExtend
imm
=
DTLBTAGS DATA
D$
Mem
ory
Tile Link
Com
mit
Commit PointXBAR + Sign
Extension
Mispredict?
EPC
EPC
EPC
CAUSE
CAUSE
CAUSE
Exception?
FPUCommand
Queue
FPUIntegerResp
Queue
HTIFRequestQueue
HTIFResponse
Queue
Prefetcher
Scoreboard(Clear)
FP Regfile (Read)
Scoreboard(Read/Set)
Decode +Hazard
DetectionLogic
FMA
ITOF FTOI
FSDQ
interrupt
SAQ
mresp_val
mresp_tag
Load/StoreAddr Check
ISDQ
mreq_data
FPULoadData
ReorderQueue
busyBYPASS
Decode
FloatingPointUnit
RECODE
Execute
Scoreboard(Clear)
Com
mit
Replay?
FSR
RECODE
FCMP
NPCGENPriority
Encoder
CAUSE
predict
predict_addrbranch_addr
mispredict
exception
epc_mem
replay
stall_decode
IMUL
Store ACKCounter
ehpc
CtrlRegs
(Read)
CtrlRegs
(Write)
Timer
ls_conflict
27
epc
eret
epc_ex
eret
missstall_fetch
miss
busy
exception
paddr
vaddr
rs
V
V
V
mreq_addr
wd0wa0
Regfile
we0
wd1wa1we1
ppn
data
tag
InstructionQueue
control
st_addr
mresp_data
mreq_tag
mreq_valmreq_rdy
EPC
FPUInteger
OperandQueue
Aligned?
dc_miss
MSHR
V
dc_busy
toPTW
4+
busy
PTWmresp_valmresp_tag
mresp_data
toITLB
toDTLB
mreq_op
TileLink
mreq_ptw
D$Control
CtrlRegs
(Read)
mreq_ptw
dc_busy
enstall_fetch
dc_miss
mode
dtlb_miss
exception
toPTW
toFIRQ
stall
waddr wdata
FP Regfile(Write)
waddr
ra0Regfile(Read)
ra1
waddrwdata
en
rdata0
rdata1
rdata2
Ex1Ex2
11
FPU built around a fused multiply-add unit (2008 revision of IEEE 754 FP standard) with full hardware support for all cases including subnormals.Regfile holds value in internal recoded format with extra bit to simplify handling of subnormals. Have to convert on load/store and move to/from integer.
22
Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011
Design Verification
23
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Verification large part of NRE costToo expensive to respin part
prototype cost in $Mslost time-to-market $10Ms
2-3X engineer time on verification versus design
Only getting worse over time as chips get larger and more complex
24
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Types of ChecksDesign verification: “Does RTL design implement the functional specification?”
Tool/implementation checking: “Does design layout match RTL design?”
Physical design checking: “Does design work across all process corners, obey all the electrical design rules (antenna rules, electromigration, ...), is power/clock/reset distribution OK, does design meet design-for-X rules (X=test, manufacturing,reliability,...)”
Manufacturing testing: “Does a fabricated chip implement the design to specification?”
25
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Design Verification Greatest ChallengeTool/implementation checking mostly automated using static formal verification checks (though finding and fixing error can be labor-intensive)
Same for EDRC rules and other physical design checks
Manufacturing tests can be automatically generated from RTL if scan chains used for all state elements (automatic test pattern generation - ATPG)
26
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Source of Bugs in RTL DesignSpecification incorrect
Designers built an implementation faithful to the specification, but the specification was wrong.
Specification misreadDesigners built an implementation faithful to their reading of the specification, but they misunderstood specification.
Incorrect RTL designThe RTL design does not do what designer wanted it to.
Incorrect RTL codingThe RTL design was correct in designer’s head, but the RTL code doesn’t match that RTL design.
27
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Avoiding Incorrect SpecificationBuild an executable version of the specification, which should be simple functional model of intended design
For RISC-V cores, we have a C++ instruction set interpreter, requiring only a few lines of code for each instruction.
Exercise executable specification inside system-level test harness with representative workload
For RISC-V, we have built a test harness that can run programs on simulator. Classic test for processors was booting Unix on functional model.System-C common in industry for this level of modeling, where entire system modeled sufficiently to run whole software stack. FPGA emulation popular to accelerate model.
28
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Avoiding Misread SpecificationHave executable specification as “golden model”
Have different designers write executable specification and system test code to catch misread specification when building golden model
If errors found, don’t just fix model, also rewrite specification to make it less ambiguous or more readable.
Perform extensive directed and random testing to compare RTL design with golden model
29
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Catching bad RTL design or codingPerform extensive directed and random testing to compare RTL design with golden model
Modern processor design team will perform many billions of cycles of RTL simulation using 10,000s cores prior to tapeout
30
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
When are you done?
But did you find all bugs, or reach limits of your test coverage?
31
Bugs found per minute of testing
Time
Bug Rate
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Test CoverageDid every bit toggle?
Was every value on every bus?
Was every state machine transition taken?
Could your tests observe this happening?
32
CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing
Unit TestingDivide and Conquer
Tradeoff between cost of defining unit boundary and improved test visibility and coverage
Typical granularity of test units in processor:Floating-point functional unitsCachesInteger coreWhole processor
33
Recommended