18
1 Lecture 11: Memory Data Flow Techniques Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation

Lecture 11: Memory Data Flow Techniques

  • Upload
    reba

  • View
    19

  • Download
    0

Embed Size (px)

DESCRIPTION

Lecture 11: Memory Data Flow Techniques. Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation. Load: LW R2, 0(R1) Generate virtual address; may wait on base register Translate virtual address into physical address Write data cache. Store: SW R2, 0(R1) - PowerPoint PPT Presentation

Citation preview

Page 1: Lecture 11: Memory Data Flow Techniques

1

Lecture 11: Memory Data Flow Techniques

Load/store buffer design, memory-level parallelism, consistency model, memory disambiguation

Page 2: Lecture 11: Memory Data Flow Techniques

2

Load/Store Execution Steps

Load: LW R2, 0(R1)1. Generate virtual

address; may wait on base register

2. Translate virtual address into physical address

3. Write data cache

Store: SW R2, 0(R1)1. Generate virtual

address; may wait on base register and data register

2. Translate virtual address into physical address

3. Write data cacheUnlike in register accesses, memory addressesare not known prior to execution

Page 3: Lecture 11: Memory Data Flow Techniques

3

Load/store Buffer in TomasuloSupport memory-level

parallelism

Loads wait in load buffer until their address is ready; memory reads are then processed

Stores wait in store buffer until their address and data are ready; memory writes wait further until stores are committed

ReorderBufferDecode

FU1 FU2

RS RS

Fetch Unit

Rename

L-bufS-buf

DM

Regfile

IM

Page 4: Lecture 11: Memory Data Flow Techniques

4

Load/store Unit with Centralized RS

Centralized RS includes part of load/store buffer in Tomasulo

Loads and stores wait in RS until there are ready

ReorderBufferDecode

FU1 FU2

Fetch Unit

Rename

S-unit

Regfile

IM

RS

cache

L-unit

data addr

addr

Store buffer

Page 5: Lecture 11: Memory Data Flow Techniques

5

Memory-level Parallelism

for (i=0;i<100;i++)A[i] = A[i]*2;

Loop: L.S F2, 0(R1)MULT F2, F2,

F4SW F2, 0(R1)ADD R1, R1, 4BNE R1,

R3,Loop

F4 store 2.0

LW1

SW1

LW2

SW2

LW3

SW3

Significant improvement from sequential reads/writes

Page 6: Lecture 11: Memory Data Flow Techniques

6

Memory Consistency

Memory contents must be the same as by sequential execution

Must respect RAW, WRW, and WAR dependences

Practical implementations:1. Reads may proceed out-of-order2. Writes proceed to memory in program

order3. Reads may bypass earlier writes only if

their addresses are different

Page 7: Lecture 11: Memory Data Flow Techniques

7

Store Stages in Dynamic Execution

1. Wait in RS until base address and store data are available (ready)

2. Move to store unit for address calculation and address translation

3. Move to store buffer (finished)

4. Wait for ROB commit (completed)

5. Write to data cache (retired)

Stores always retire in for WAW and WRA Dep.

Source: Shen and Lipasti, page 197

finished

completed

D-cache

RS

Storeunit

Loadunit

Page 8: Lecture 11: Memory Data Flow Techniques

8

Load Bypassing and Memory Disambiguation

To exploit memory parallelism, loads have to bypass writes; but this may violate RAW dependences

Dynamic Memory Disambiguation: Dynamic detection of memory dependencesCompare load address with every older store addresses

Page 9: Lecture 11: Memory Data Flow Techniques

9

Load Bypassing Implementation

12

D-cache

RS

Storeunit

12

Loadunit

3

1. address calc.2. address trans.3. if no match, updatedest reg

Associative search formatching

match

addrdata

data

addr Assume in-order executionof load/stores

in-order

Page 10: Lecture 11: Memory Data Flow Techniques

10

Load Forwarding

Load Forwarding: if a load address matches a older write address, can forward data

If a match is found, forward the related data to dest register (in ROB)

Multiple matches may exists; last one wins

12

D-cache

Storeunit

12

Loadunit

3match

addrdata

data

addr

To dest.reg

RSin-order

Page 11: Lecture 11: Memory Data Flow Techniques

11

In-order Issue LimitationAny store in RS station

may blocks all following loads

When is F2 of SW available?

When is the next L.S ready?

Assume reasonable FU latency and pipeline length

for (i=0;i<100;i++)A[i] = A[i]/2;

Loop: L.S F2, 0(R1)DIV F2, F2,

F4SW F2, 0(R1)ADD R1, R1, 4BNE R1,

R3,Loop

Page 12: Lecture 11: Memory Data Flow Techniques

12

Speculative Load Execution

12

D-cache

RS

Storeunit

12

Loadunit

3match

addrdata

data

out-order

addrdataFinishedload buffer

Match at completion

If match: flush pipeline

Forwarding does not always work if some addresses are unknown

No match: predict a load has no RAW on older stores

Flush pipeline at commit if predicted wrong

Page 13: Lecture 11: Memory Data Flow Techniques

13

Alpha 21264 Pipeline

Page 14: Lecture 11: Memory Data Flow Techniques

14

Alpha 21264 Load/Store Queues

AddrALU

IntALU

IntALU

AddrALU

Int issue queue fp issue queue

FPALU

FPALU

Int RF(80) Int RF(80) FP RF(72)

D-TLB L-Q S-Q AF

Dual D-Cache

32-entry load queue, 32-entry store queue

Page 15: Lecture 11: Memory Data Flow Techniques

15

Load Bypassing, Forwarding, and RAW Detection

commitmatch

D-cacheD-cache

If match: mark store-load trapto flush pipeline (at commit)

If match: forwarding

IQ IQ completed

Load/store?ROBLoad: WAIT if LQ head not completed, then move LQ headStore: mark SQ head as completed, then move SQ head

LQ SQ

Page 16: Lecture 11: Memory Data Flow Techniques

16

Speculative Memory Disambiguation

1024 1-bitentry table

Fetch PC

Renamed inst

int issue queue

1

• When a load is trapped at commit, set stWait bit in the table, indexed by the load’s PC• When the load is fetched, get its stWait from the table• The load waits in issue queue until old stores are issued• stWait table is cleared periodically

Load forwarding

Page 17: Lecture 11: Memory Data Flow Techniques

17

Architectural Memory States

Memory request: search the hierarchy from top to bottom

L1-Cache

L2-Cache

Memory

Disk, Tape, etc.

L3-Cache (optional)

Completed entries

LQSQ Committed

states

Page 18: Lecture 11: Memory Data Flow Techniques

18

Summary of Superscalar Execution

Instruction flow techniquesBranch prediction, branch target prediction,

and instruction prefetch

Register data flow techniquesRegister renaming, instruction scheduling, in-

order commit, mis-prediction recovery

Memory data flow techniquesLoad/store units, memory consistency

Source: Shen & Lipasti