Upload
lamhanh
View
215
Download
0
Embed Size (px)
Citation preview
2EE 215B
Overview
• Reading– Papers
• Overview– An extreme of “SRAM” design is the register file. Register
files are small SRAMs that are used heavily by the datapath. It serves as very local information that is fast to access. It often involves multiple ports for simultaneous access by a number of functional units/ALUs.
– These design parameters lead to very different cell designs and performance targets. This set of notes reviews the basic concepts and shows an example of such a design.
3
Outline
• Architecture– What is a register file– 2 basic approaches
• Design Example
EE 215B
4
What Is a Register File
• Fastest memory block available to the microprocessor.• Stores intermediate results of the microprocessor units such as
ALU & MMU • Access speed is directly proportional to the performance of the
processor.
EE 215B
5
Architecture: Multi-ported Design
• At least 1 write port and 2 read ports– Accommodate a single
ALU with 2-operand instructions.
– r3 <= r2 + r1 • Superscalar designs
– Multiple functional units access the register file.
5EE 215B
6
Example: 3-ported Cell
• Separate read/write bitlines– Single-port reads– Dual-port write
• Enable different design constraints– Cell sizing– Different pre-charge
of the read-port
EE 215B
7
Architecture: Multi-banking
• Multi-porting has a large cost in peripheral circuits.– Replicate memory into many
banks• Homogenous – even
division to a number of banks.– Faster access to each bank.– Smaller register size– More MUXing circuitry
EE 215B
8
Heterogeneous Multi-banking
• Dividing the ports and registers unevenly to the banks.– Smaller bank for the critical
data– Bigger bank for the non-
critical data• Prediction of critical data
based on an algorithm similar to cache prediction.
EE 215B
9
Outline
• Architecture
• Design Example– Itanium register file
EE 215B
10
Itanium 2 Integer Register File
• 6 ALUs share 144 x 65 bit 22 ported general registers• 128 GRs + 16 Kernel Register aliased to R16-31• 64 data path bits plus parity
• 12 read ports and 10 write ports – 8 active, 2 inactive• Active and inactive writes can occur simultaneously
• Datapath bypassing on write ports between multi-media (MMU) and integer execution units (IEU)
EE 215B
1.37mm
1.00
mm
MMUIEU
FetzerISSCC05
11
Integer RF Structure
DecodeData Array
Parity State Machine
Address Repeater
Global Precharger
Bitline Repeater
Address Driver
FetzerISSCC05EE 215B
12
Floating Point Register File
1.11mm
1.14
mm
MACMAC
• 128 x 82 bit 18 ported general registers• 8 Read Ports
• 6 MAC data ports, 2 store data ports• 10 write ports, 6 active 4 inactive
• 2 MAC result ports , 4 load data ports
EE 215B FetzerISSCC05
13
Floating Point RF Structure
Decode
Data ArrayParity State Machine
Address Repeater
Bitline Repeater/Globa
l Precharger
Address Driver
EE 215B FetzerISSCC05
14
Register File Timing
Write Bit Line
Data Bypass
Read Local BitlineEvaluate
Register Write
READ
WRITEWriteAddr
Decode
ReadAddr
Decode
WriteBitline
Pre-discharge
Read Global BitlineEvaluate
Read LocalPrecharge
Read GlobalPrecharge
CK Phase 1 CK Phase 2
FetzerISSCC05EE 215B
15
Write Following Reads
• Reading a register that is being written into occurs very often• Itanium solution
– Each register file access contains a READ followed by a WRITE.
– No contention, the READ result can be used half-cycle early.• Another common solution
– Write bypass:• WRITE while READ results in a slow read since the cell is
being flipped.• Bypass the READ with the WRITE information at the
multiplexer.
EE 215B
16
Register File Decode
• Wordline (en) is pulsed– PCK2X pulses each phase– Read followed by write
• WriteH is generated for the accessed register
16
address
PCK2
self-timed pulse width controlhighb
lowb
highb
lowbsel[i] one read/write port
sel[9:0]
NCK
WRITEHwriteen
en
timer_enable
matchb
PCK2
wordline
FetzerISSCC05
17
Storage Cell
• One storage node for each thread
• Storage node– Tristated by writel to
assist NFET only pass gate writes.
– writel drain connected PFETs provide extra pull-up during a thread switch and make write easier.
17
nb0
writel
b0
nb0
writel
b1nb1
nb1
thread
thread
thread
thread
thread
writel
WRITEH
Storage nodes thread selection
writel
ida
idb
FetzerISSCC05
18
Register File READ/WRITE (1)
• Buffered read– Isolate the cell from
the read BL• Additional buffering from
write– Isolate stored data
from read access.– Improve the write
timing
wordline[9:0]
-
writel
write
read
writ
e bi
tline
read
bitl
ine
writel
read
writ
e bi
tline
read
bitl
ine
writei
activedata
inactivedata
EE 215B
19
Register File READ/WRITE (2)
• Port sharing– Active thread READ
shares wordlines with inactive WRITE
– Reduce the number of total ports
wordline[9:0]
-
writel
write
read
read/write circuit
writ
e bi
tline
read
bitl
ine
writel
read
writ
e bi
tline
read
bitl
ine
writei
activedata
inactivedata
EE 215B
20
Register File READ/WRITE (3)
• Wordline conditioned by writel– Writel high, enables the
read– Writel low, enables the pull
up for the write.
wordline[9:0]
-
writel
write
read
read/write circuit
writ
e bi
tline
read
bitl
ine
writel
read
writ
e bi
tline
read
bitl
ine
writei
activedata
inactivedata
EE 215B
21
Register File Organization
• 8 banks– 16 registers per bank
• 8 cells per bitline– 2 bitlines merge at the sense-amplifier– Small number of cells
• Logic gate as the sense amplifiers• Pre-charged and evaluates low (high-skew)
• 200ps access time!
EE 215B
22
Register File Read Path
CK
PRECK
local1reg0 reg7
read0 read7. . .local0
CK
global
LG8LG0. . . .
PRECK
Pulldown in bitcell global bitline circuit
read
EE 215B
23
READ Simulation
• Just over 200ps from CK to global bitline evaluate– PCK2X pulses twice per
cycle– Matchb is the wordline
enable signal.• Local read/write signals
generated from each wordline
Local BL
Global BL
PCK2X
Wordline
Read
Matchb
EE 215B
24
WRITE Simulation
writeWRITEH
b0
Writing a “1” Writing a “0”
wordline
nb0
writel
b0
nb0
thread
thread
thread
WRITEH writel
ida
idb
write bitline
write
To read port
and parity writel
wordline
25
NCK
Floating Nodes During Write
nb0
writel
b0
nb0writel
WRITEH writel
RF Storage Node
•The storage node in the inactive thread floats low during writes to the active thread.
•At low frequency data could be lost so a timer is implemented on WRITEH to end the writes early
nr1
NCK
enable
•NCK rises and nr1 slowly drops. If the NCK phase is long enough enable drops low ending the writeSlow long L devices
treadchangedTIMER CIRCUIT
EE 215B
26
Switching Threads
• The READ/WRITE I/O ports look like large caps and there is a significant amount of charge sharing
• WRITEH is held at GND when thread/thread_b change values
nb0
writel
b0
nb0
writel
b1nb1
nb1
thread
thread
thread
thread
thread
writel
WRITEH writel
ida
idb
EE 215B
27
Switching Threads Simulation
threadthread
idab0
nb0
idbb1
nb0
Needed or b1 would fail!
nb0
writel
b0
nb0
writel
b1nb1
nb1
thread
thread
thread
thread
thread
writel
WRITEH writel
ida
idb
EE 215B
28
Parity
• Parity ripples through 32 stages in three clock cycles after a write (41 stages in four cycles in FPU)
• The two bit parity computation is 6.5 FETs per bit out of 109.5 (<6.0%)
biti-1
d1i d1b
parityin
d1b
parityin
midp
parityind1i
d1iparityin
outpb
d0i d0b
biti
d0b
biti
biti-1
bitid0i
d0ibiti
midpparityout
biti-1biti parityin
parityout
Parity Functional Representation
FETs shared with Read Buffering
EE 215B
29
Parity State Machine
• The parity state machine is below the data array and gets the same inputs (wordlines/write/parity_in) as a bitcell
• Parity is continuously computed and checked – Register file outputs parity error. – Scan can observe a parity error before the register is read
• ParityError is read with a duplicate of a register read circuit
29
b0 b2b1 b81…...XOR computation tree
thread
Register N
parity
en
enthread
thread
ParityComp
ParityError
StoredParity
ParitySeed
write
ThreadChanged
30
Register File Comparison
Design MontecitoInteger
MontecitoFP
McKinleyInteger
ISSCC 2002
Technology 0.09μm 0.09μm 0.18μmWrite Ports 10 10 8Read Ports 12 8 12Registers 144 x 65bit 128 x 82bit 128 x 65bitTransistors 1.43M 1.30M 832K
Parity SM Area 0.098mm2 0.083mm2 NAArray Area 0.930mm2 0.935mm2 1.67mm2
Decoder Area 0.330mm2 0.220mm2 0.39mm2
Global Overhead 0.012mm2 0.052mm2 0.13mm2
Total Size 1.37mm2 1.29mm2 2.2mm2
31
Summary
• Register files are critical functional units similar to ALUs.– Determine the cycle-time of a processor
• Highly constrained memory design– Small number of entries– Large number of ports– Highly partitioned (tradeoff of #ports per cell versus many
cells).• Cell design is very unique.
– Single-ended reads– Buffered reads– Multi-threading
• Sense-amplifiers are often digital logic gates• Parity protection is increasingly critical for reliability.
Reference 3