Lecture 11: Memory Data Flow Techniques

1

Lecture 11: Memory Data Flow Techniques

Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation

2

Load/Store Execution Steps

Load: LW R2, 0(R1)1. Generate virtual

address; may wait on base register

2. Translate virtual address into physical address

3. Write data cache

Store: SW R2, 0(R1)1. Generate virtual

address; may wait on base register and data register

2. Translate virtual address into physical address

3. Write data cacheUnlike in register accesses, memory addressesare not known prior to execution

3

Load/store Buffer in TomasuloSupport memory-level

parallelism

Loads wait in load buffer until their address is ready; memory reads are then processed

Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed

ReorderBufferDecode

FU1 FU2

RS RS

Fetch Unit

Rename

L-bufS-buf

DM

Regfile

IM

4

Load/store Unit with Centralized RS

Centralized RS includes part of load/store buffer in Tomasulo

Loads and stores wait in RS until there are ready

ReorderBufferDecode

FU1 FU2

Fetch Unit

Rename

S-unit

Regfile

IM

RS

cache

L-unit

data addr

addr

Store buffer

5

Memory-level Parallelism

for (i=0;i<100;i++)A[i] = A[i]*2;

Loop: L.S F2, 0(R1)MULT F2, F2,

F4SW F2, 0(R1)ADD R1, R1, 4BNE R1,

R3,Loop

F4 store 2.0

LW1

SW1

LW2

SW2

LW3

SW3

Significant improvement from sequential reads/writes

6

Memory Consistency

Memory contents must be the same as by sequential execution

Must respect RAW, WRW, and WAR dependences

Practical implementations:1. Reads may proceed out-of-order2. Writes proceed to memory in program

order3. Reads may bypass earlier writes only if

their addresses are different

7

Store Stages in Dynamic Execution

1. Wait in RS until base address and store data are available (ready)

2. Move to store unit for address calculation and address translation

3. Move to store buffer (finished)

4. Wait for ROB commit (completed)

5. Write to data cache (retired)

Stores always retire in for WAW and WRA Dep.

Source: Shen and Lipasti, page 197

finished

completed

D-cache

RS

Storeunit

Loadunit

8

Load Bypassing and Memory Disambiguation

To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences

Dynamic Memory Disambiguation: Dynamic detection of memory dependencesCompare load address with every older store addresses

9

Load Bypassing Implementation

12

D-cache

RS

Storeunit

12

Loadunit

3

1. address calc.2. address trans.3. if no match, updatedest reg

Associative search formatching

match

addrdata

data

addr Assume in-order executionof load/stores

in-order

10

Load Forwarding

Load Forwarding: if a load address matches a older write address, can forward data

If a match is found, forward the related data to dest register (in ROB)

Multiple matches may exists; last one wins

12

D-cache

Storeunit

12

Loadunit

3match

addrdata

data

addr

To dest.reg

RSin-order

11

In-order Issue LimitationAny store in RS station

may blocks all following loads

When is F2 of SW available?

When is the next L.S ready?

Assume reasonable FU latency and pipeline length

for (i=0;i<100;i++)A[i] = A[i]/2;

Loop: L.S F2, 0(R1)DIV F2, F2,

F4SW F2, 0(R1)ADD R1, R1, 4BNE R1,

R3,Loop

12

Speculative Load Execution

12

D-cache

RS

Storeunit

12

Loadunit

3match

addrdata

data

out-order

addrdataFinishedload buffer

Match at completion

If match: flush pipeline

Forwarding does not always work if some addresses are unknown

No match: predict a load has no RAW on older stores

Flush pipeline at commit if predicted wrong

13

Alpha 21264 Pipeline

14

Alpha 21264 Load/Store Queues

AddrALU

IntALU

IntALU

AddrALU

Int issue queue fp issue queue

FPALU

FPALU

Int RF(80) Int RF(80) FP RF(72)

D-TLB L-Q S-Q AF

Dual D-Cache

32-entry load queue, 32-entry store queue

15

Load Bypassing, Forwarding, and RAW Detection

commitmatch

D-cacheD-cache

If match: mark store-load trapto flush pipeline (at commit)

If match: forwarding

IQ IQ completed

Load/store?ROBLoad: WAIT if LQ head not completed, then move LQ headStore: mark SQ head as completed, then move SQ head

LQ SQ

16

Speculative Memory Disambiguation

1024 1-bitentry table

Fetch PC

Renamed inst

int issue queue

1

• When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC• When the load is fetched, get its stWait from the table• The load waits in issue queue until old stores are issued• stWait table is cleared periodically

Load forwarding

17

Architectural Memory States

Memory request: search the hierarchy from top to bottom

L1-Cache

L2-Cache

Memory

Disk, Tape, etc.

L3-Cache (optional)

Completed entries

LQSQ Committed

states

18

Summary of Superscalar Execution

Instruction flow techniquesBranch prediction, branch target prediction,

and instruction prefetch

Register data flow techniquesRegister renaming, instruction scheduling, in-

order commit, mis-prediction recovery

Memory data flow techniquesLoad/store units, memory consistency

Source: Shen & Lipasti

Documents

Lecture 11: Memory Data Flow Techniques