23
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University of Texas at Austin Presentation by Pravin Dalale

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

Embed Size (px)

Citation preview

Page 1: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

1

Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

Authors

Wei-fen Lin and Steven K. Reinhardt, University of Michigan

Doug Burger, University of Texas at Austin

Presentation

by

Pravin Dalale

Page 2: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

2

OUTLINE

• Motivation

• Main idea in the paper- Analysis- Main idea

• Prefetch engine- Insertion policy- Prefetch scheduling

• Results

• Conclusion

Page 3: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

3

Motivation

Memory density and capacity have grown along with the CPU power and complexity, but memory speed has not kept pace.

1990 1980 2000 2010 1

10

10

Re

lati

ve p

erf

orm

anc

e

Calendar year

Processor

Memory

3

6

Page 4: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

4

Solutions

• Multithreading

• Multiple levels of caches

• Prefetching

Page 5: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

5

OUTLINE

• Motivation

• Main idea in the paper- Analysis- Main idea

• Prefetch engine- Insertion policy- Prefetch scheduling

• Results

• Conclusion

Page 6: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

6

Analysis (1)

• IPCReal – Instructions per cycle with real memory system

• IPCPerfectL2 - Instructions per cycle with real L1 cache but perfect L2 cache

• IPCPerfectMem - Instructions per cycle with perfetct L1 cache but perfect L2 cache

Page 7: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

7

Analysis (2)

• Fraction of performance lost due to imperfect L1 and L2 =

(IPCPerfectMem – IPCReal) / IPCPerfectMem

• Fraction of performance lost due to imperfect L2 =

(IPCPerfectL2 – IPCReal) / IPCPerfectL2

Page 8: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

8

Analysis (3)

- Simulated 1.6GHz, out-of-order core- 64KB L1- 1MB L2- Direct Rambus Memory System with four

1.6GB/s channels

The 26 SPEC benchmarks were tested on this system to obtain IPCReal, IPCPerfectL2,

IPCPerfectMem.

Page 9: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

9

Analysis (4)

L2 stall fraction is 80% for mcf benchmark

Average stall fraction caused by L2 misses is 57% over 26 SPEC CPU2000 benchmarks

Page 10: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

10

Main idea

• Paper describes technique to reduce L2 miss latencies

• Introduces a prefetch engine to prefetch data to L2 cache upon a L2 demand miss

Page 11: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

11

OUTLINE

• Motivation

• Main idea in the paper- Analysis- Main idea

• Prefetch Engine- Insertion policy- Prefetch scheduling

• Results

• Conclusion

Page 12: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

12

Prefetch Engine

1

2

3

Page 13: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

13

Prefetch Engine

3

1

2

1 Prefetch queue maintains the list of n region entries not in L2 cache

2 Prefetch prioritizer uses the bank state and the region age to determine which prefetch to issue next.

3 Access prirotizer selects a prefetch in case of no demand misses

Page 14: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

14

Insertion policy (1)

The prefetched block may be loaded into L2 with one of four priorities:

1. most-recently-used (MRU)

2. second-most-recently-used (SMRU)

3. second-least-recently-used (SLRU)

4. least-recently-used (LRU)

Page 15: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

15

Insertion policy (2)Benchmarks were divided into two classes

1. High (above 20%) prefetch accuracy benchmarks

2. Low (below 20%) prefetch accuracy benchmarks

All benchmarks were tested for four possible insertion policies.

LRU insertion policy gives best results in both categories.

Page 16: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

16

Prefetch Scheduling• Simple aggressive prefetching can consume large amount of bandwidth

and cause channel contention

• This large contention at channel can be avoided scheduling prefetch accesses onlt when Rambus channels are idle

Page 17: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

17

OUTLINE

• Motivation

• Main idea in the paper- Analysis- Main idea

• Prefetch engine- Insertion policy- Prefetch scheduling

• Results

• Conclusion

Page 18: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

18

Results (1/3) - Overall performance improvement

The performance with prefetching is very close to that of perfect L2

Page 19: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

19

Results (2/3) - Sensitivity of prefetch scheme to DRAM latencies

•Base DRDRAM had 40ns latency and 800MHz data transfer rate

•If latency is increased to 50ns the mean performance of prefetch scheme reduces by less than 1% as compared to the base system

•If latency is reduced to 34ns the mean performance of prefetch scheme was again reduced by less than 2%

Page 20: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

20

Results (3/3) - Interaction with software prefetching

•When prefetch scheme is coupled with software prefetching, none of the benchmarks improved significantly (at most 2%)

•Thus proposed prefetch scheme overshadows the software prefetching benefits

Page 21: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

21

OUTLINE

• Motivation

• Main idea in the paper- Analysis- Main idea

• Prefetch engine- Insertion policy- Prefetch scheduling

• Results

• Conclusion

Page 22: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

22

Conclusions

•Authors proposed and evaluated a prefetch architecture, integrated with on chip L2 cache

•This architecture involves aggressive prefetching of large regions of data to L2 on demand misses

•By scheduling these prefetches only during idle cycles, inserting them into the cache with low replacement prioroty a significant improvement is obtained in 10 of 26 SPEC benchmarks

Page 23: 1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University

23

QUESTIONS