Design Example: Register Files - UCLAicslwebs.ee.ucla.edu/yang/classwiki/images/7/7e/15b_ee...4 What...

Preview:

Citation preview

1EE 215B

Design Example: Register Files

C.K. Ken YangUCLA

yangck@ucla.eduCourtesy of BA, MAH

2EE 215B

Overview

• Reading– Papers

• Overview– An extreme of “SRAM” design is the register file. Register

files are small SRAMs that are used heavily by the datapath. It serves as very local information that is fast to access. It often involves multiple ports for simultaneous access by a number of functional units/ALUs.

– These design parameters lead to very different cell designs and performance targets. This set of notes reviews the basic concepts and shows an example of such a design.

3

Outline

• Architecture– What is a register file– 2 basic approaches

• Design Example

EE 215B

4

What Is a Register File

• Fastest memory block available to the microprocessor.• Stores intermediate results of the microprocessor units such as

ALU & MMU • Access speed is directly proportional to the performance of the

processor.

EE 215B

5

Architecture: Multi-ported Design

• At least 1 write port and 2 read ports– Accommodate a single

ALU with 2-operand instructions.

– r3 <= r2 + r1 • Superscalar designs

– Multiple functional units access the register file.

5EE 215B

6

Example: 3-ported Cell

• Separate read/write bitlines– Single-port reads– Dual-port write

• Enable different design constraints– Cell sizing– Different pre-charge

of the read-port

EE 215B

7

Architecture: Multi-banking

• Multi-porting has a large cost in peripheral circuits.– Replicate memory into many

banks• Homogenous – even

division to a number of banks.– Faster access to each bank.– Smaller register size– More MUXing circuitry

EE 215B

8

Heterogeneous Multi-banking

• Dividing the ports and registers unevenly to the banks.– Smaller bank for the critical

data– Bigger bank for the non-

critical data• Prediction of critical data

based on an algorithm similar to cache prediction.

EE 215B

9

Outline

• Architecture

• Design Example– Itanium register file

EE 215B

10

Itanium 2 Integer Register File

• 6 ALUs share 144 x 65 bit 22 ported general registers• 128 GRs + 16 Kernel Register aliased to R16-31• 64 data path bits plus parity

• 12 read ports and 10 write ports – 8 active, 2 inactive• Active and inactive writes can occur simultaneously

• Datapath bypassing on write ports between multi-media (MMU) and integer execution units (IEU)

EE 215B

1.37mm

1.00

mm

MMUIEU

FetzerISSCC05

11

Integer RF Structure

DecodeData Array

Parity State Machine

Address Repeater

Global Precharger

Bitline Repeater

Address Driver

FetzerISSCC05EE 215B

12

Floating Point Register File

1.11mm

1.14

mm

MACMAC

• 128 x 82 bit 18 ported general registers• 8 Read Ports

• 6 MAC data ports, 2 store data ports• 10 write ports, 6 active 4 inactive

• 2 MAC result ports , 4 load data ports

EE 215B FetzerISSCC05

13

Floating Point RF Structure

Decode

Data ArrayParity State Machine

Address Repeater

Bitline Repeater/Globa

l Precharger

Address Driver

EE 215B FetzerISSCC05

14

Register File Timing

Write Bit Line

Data Bypass

Read Local BitlineEvaluate

Register Write

READ

WRITEWriteAddr

Decode

ReadAddr

Decode

WriteBitline

Pre-discharge

Read Global BitlineEvaluate

Read LocalPrecharge

Read GlobalPrecharge

CK Phase 1 CK Phase 2

FetzerISSCC05EE 215B

15

Write Following Reads

• Reading a register that is being written into occurs very often• Itanium solution

– Each register file access contains a READ followed by a WRITE.

– No contention, the READ result can be used half-cycle early.• Another common solution

– Write bypass:• WRITE while READ results in a slow read since the cell is

being flipped.• Bypass the READ with the WRITE information at the

multiplexer.

EE 215B

16

Register File Decode

• Wordline (en) is pulsed– PCK2X pulses each phase– Read followed by write

• WriteH is generated for the accessed register

16

address

PCK2

self-timed pulse width controlhighb

lowb

highb

lowbsel[i] one read/write port

sel[9:0]

NCK

WRITEHwriteen

en

timer_enable

matchb

PCK2

wordline

FetzerISSCC05

17

Storage Cell

• One storage node for each thread

• Storage node– Tristated by writel to

assist NFET only pass gate writes.

– writel drain connected PFETs provide extra pull-up during a thread switch and make write easier.

17

nb0

writel

b0

nb0

writel

b1nb1

nb1

thread

thread

thread

thread

thread

writel

WRITEH

Storage nodes thread selection

writel

ida

idb

FetzerISSCC05

18

Register File READ/WRITE (1)

• Buffered read– Isolate the cell from

the read BL• Additional buffering from

write– Isolate stored data

from read access.– Improve the write

timing

wordline[9:0]

-

writel

write

read

writ

e bi

tline

read

bitl

ine

writel

read

writ

e bi

tline

read

bitl

ine

writei

activedata

inactivedata

EE 215B

19

Register File READ/WRITE (2)

• Port sharing– Active thread READ

shares wordlines with inactive WRITE

– Reduce the number of total ports

wordline[9:0]

-

writel

write

read

read/write circuit

writ

e bi

tline

read

bitl

ine

writel

read

writ

e bi

tline

read

bitl

ine

writei

activedata

inactivedata

EE 215B

20

Register File READ/WRITE (3)

• Wordline conditioned by writel– Writel high, enables the

read– Writel low, enables the pull

up for the write.

wordline[9:0]

-

writel

write

read

read/write circuit

writ

e bi

tline

read

bitl

ine

writel

read

writ

e bi

tline

read

bitl

ine

writei

activedata

inactivedata

EE 215B

21

Register File Organization

• 8 banks– 16 registers per bank

• 8 cells per bitline– 2 bitlines merge at the sense-amplifier– Small number of cells

• Logic gate as the sense amplifiers• Pre-charged and evaluates low (high-skew)

• 200ps access time!

EE 215B

22

Register File Read Path

CK

PRECK

local1reg0 reg7

read0 read7. . .local0

CK

global

LG8LG0. . . .

PRECK

Pulldown in bitcell global bitline circuit

read

EE 215B

23

READ Simulation

• Just over 200ps from CK to global bitline evaluate– PCK2X pulses twice per

cycle– Matchb is the wordline

enable signal.• Local read/write signals

generated from each wordline

Local BL

Global BL

PCK2X

Wordline

Read

Matchb

EE 215B

24

WRITE Simulation

writeWRITEH

b0

Writing a “1” Writing a “0”

wordline

nb0

writel

b0

nb0

thread

thread

thread

WRITEH writel

ida

idb

write bitline

write

To read port

and parity writel

wordline

25

NCK

Floating Nodes During Write

nb0

writel

b0

nb0writel

WRITEH writel

RF Storage Node

•The storage node in the inactive thread floats low during writes to the active thread.

•At low frequency data could be lost so a timer is implemented on WRITEH to end the writes early

nr1

NCK

enable

•NCK rises and nr1 slowly drops. If the NCK phase is long enough enable drops low ending the writeSlow long L devices

treadchangedTIMER CIRCUIT

EE 215B

26

Switching Threads

• The READ/WRITE I/O ports look like large caps and there is a significant amount of charge sharing

• WRITEH is held at GND when thread/thread_b change values

nb0

writel

b0

nb0

writel

b1nb1

nb1

thread

thread

thread

thread

thread

writel

WRITEH writel

ida

idb

EE 215B

27

Switching Threads Simulation

threadthread

idab0

nb0

idbb1

nb0

Needed or b1 would fail!

nb0

writel

b0

nb0

writel

b1nb1

nb1

thread

thread

thread

thread

thread

writel

WRITEH writel

ida

idb

EE 215B

28

Parity

• Parity ripples through 32 stages in three clock cycles after a write (41 stages in four cycles in FPU)

• The two bit parity computation is 6.5 FETs per bit out of 109.5 (<6.0%)

biti-1

d1i d1b

parityin

d1b

parityin

midp

parityind1i

d1iparityin

outpb

d0i d0b

biti

d0b

biti

biti-1

bitid0i

d0ibiti

midpparityout

biti-1biti parityin

parityout

Parity Functional Representation

FETs shared with Read Buffering

EE 215B

29

Parity State Machine

• The parity state machine is below the data array and gets the same inputs (wordlines/write/parity_in) as a bitcell

• Parity is continuously computed and checked – Register file outputs parity error. – Scan can observe a parity error before the register is read

• ParityError is read with a duplicate of a register read circuit

29

b0 b2b1 b81…...XOR computation tree

thread

Register N

parity

en

enthread

thread

ParityComp

ParityError

StoredParity

ParitySeed

write

ThreadChanged

30

Register File Comparison

Design MontecitoInteger

MontecitoFP

McKinleyInteger

ISSCC 2002

Technology 0.09μm 0.09μm 0.18μmWrite Ports 10 10 8Read Ports 12 8 12Registers 144 x 65bit 128 x 82bit 128 x 65bitTransistors 1.43M 1.30M 832K

Parity SM Area 0.098mm2 0.083mm2 NAArray Area 0.930mm2 0.935mm2 1.67mm2

Decoder Area 0.330mm2 0.220mm2 0.39mm2

Global Overhead 0.012mm2 0.052mm2 0.13mm2

Total Size 1.37mm2 1.29mm2 2.2mm2

31

Summary

• Register files are critical functional units similar to ALUs.– Determine the cycle-time of a processor

• Highly constrained memory design– Small number of entries– Large number of ports– Highly partitioned (tradeoff of #ports per cell versus many

cells).• Cell design is very unique.

– Single-ended reads– Buffered reads– Multi-threading

• Sense-amplifiers are often digital logic gates• Parity protection is increasingly critical for reliability.

Reference 3

Recommended