46
1/11/00 HPCA-6 1 2K papers on caches by Y2K: Do we need more? Jean-Loup Baer Dept. of Computer Science & Engineering University of Washington

2K papers on caches by Y2K: Do we need more?

Embed Size (px)

DESCRIPTION

2K papers on caches by Y2K: Do we need more?. Jean-Loup Baer Dept. of Computer Science & Engineering University of Washington. A little bit of history. The Y0K problem. A little bit of history. The Y0K problem The Y1K problem. A little bit of history. The Y0K problem The Y1K problem - PowerPoint PPT Presentation

Citation preview

Page 1: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 1

2K papers on caches by Y2K:Do we need more?

Jean-Loup Baer

Dept. of Computer Science & Engineering

University of Washington

Page 2: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 2

A little bit of history

• The Y0K problem

Page 3: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 3

A little bit of history

• The Y0K problem• The Y1K problem

Page 4: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 4

A little bit of history

• The Y0K problem• The Y1K problem

– Pour la version française, qui était Roi de France en l’an 1000?

Page 5: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 5

Outline

• More history• Anthology• Challenges• Conclusion

Page 6: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 6

More history

• Caches introduced (commercially) more than 30 years ago in the IBM 360/85– already a processor-memory gap

• Oblivious to the ISA– caches were organization, not architecture

• Sector caches– to minimize tag area

• Single level; off-chip

Page 7: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 7

Terminology

• One of the original designers (Gibson) had first coined the name muffer

• When papers were submitted, the authors (Conti, Gibson, Liptay, Pitkovsky) used the term high-speed buffer

• The EIC of IBM Systems Journal (R.L.Johnson) suggested a more sexy name, namely cache, after consulting a thesaurus

Page 8: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 8

Today

• Caches are ubiquitous– On-chip, off-chip– But also, disk caches, web caches, trace caches etc.

• Multilevel cache hierarchy– With inclusion or exclusion

• Many different organizations– direct-mapped, set-associative, skewed-associative,

sector, decoupled sector etc.

Page 9: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 9

Today (c’ed)

• Cache exposed to the ISA– Prefetch, Fence, Purge etc.

• Cache exposed to the compiler– Code and data placement

• Cache exposed to the O.S.– Page coloring

• Many different write policies– copy-back, write-through, fetch-on-write, write-around,

write-allocate etc.

Page 10: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 10

Today (c’ed)

• Numerous cache assists, for example:– For storage: write-buffers, victim caches,

temporal/spatial caches

– For overlap: lock-up free caches

– For latency reduction: prefetch

– For better cache utilization: bypass mechanisms, dynamic line sizes

– etc ...

Page 11: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 11

Caches and Parallelism

• Cache coherence– Directory schemes

– Snoopy protocols

• Synchronization– Test-and-test-and-set

– load linked -- store conditional

• Models of memory consistency

Page 12: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 12

When were the 2K papers being written?

• A few facts:– 1980 textbook: < 10 pages on caches (2%)

– 1996 textbook: > 120 pages on caches (20%)

• Smith survey (1982)– About 40 references on caches

• Uhlig and Mudge survey on trace-driven simulation (1997)– About 25 references specific to cache performance only

– Many more on tools for performance etc.

Page 13: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 13

Cache research vs. time

% of ISCA papers dealing principally with caches

0

5

10

15

20

25

30

35

1st session on caches

Largest number (14)

Page 14: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 14

Outline

• More history• Anthology• Challenges• Conclusion

Page 15: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 15

Some key papers - Cache Organization

• Conti (Computer 1969): direct-mapped (cf. “slave

memory” and “tags” in Wilkes 1965), set-associativity • Bell et al (IEEE TC 1974): cache design for small

machines (advocated unified caches; pipelining nullified that )• Hill (Computer 1988): the case for direct-mapped

caches (technology has made the case obsolete)• Smith (Computing Surveys 1982): virtual vs.

physical addressing (first cogent discussion)

Page 16: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 16

Some key papers - Qualitative Properties

• Smith (Computing Surveys 1982): Spatial and temporal locality

• Hill (Ph.D 1987): The three C’s• Baer and Wang (ISCA 1988): Multi-level

inclusion

Page 17: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 17

Some key papers - Cache Evaluation Methodology

• Belady (IBM Systems J. 1966): MIN and OPT• Mattson et al. (IBM Systems J. 1970): The “stack”

property• Trace collection:

– Hardware: Clark (ACM TOCS 1983)

– Microcode: Agarwal, Sites and Horowitz (ISCA 1986): ATUM

– Software: M. Smith (1991): Pixie

– Very long traces: Borg, Kessler and Wall (ISCA 1990)

Page 18: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 18

Some key papers - Cache Performance

• Kaplan and Winder (Computer 1973): 8 to 16K caches with block sizes of 64 to 128 bytes and set-associativity 2 or 4 will yield hit ratios of over 95%

• Strecker (ISCA 1976) :Design of the PDP 11/70 -- 2KB, 2-way set-associative, 4 byte (2 words) block size

• Smith (Computing Surveys 1982):Most comprehensive study of the time: prefetching, replacement, associativity, line size etc.

• Przybylski et al. (ISCA 1988): Comprehensive study 6 years later

• Woo et al. (ISCA 1995): Splash-2

Page 19: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 19

Some key papers - Cache Assists

• IBM ??: Write buffers• Gindele (IBM TD Bull 1977): OBL prefetch (OBL

coined by Smith?)

• Kroft (ISCA 1981): Lock-up free caches• Jouppi (ISCA 1990): Victim caches; stream

buffers• Pettis and Hansen (PLDI 1990): Code placement

Page 20: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 20

Some key papers - Cache Coherence

• Censier and Feautrier (IEEE TC 1978): Directory scheme

• Goodman (ISCA 1983): The first snoopy protocol• Archibald and Baer (TOCS 1986): Snoopy

terminology• Dubois, Scheurich and Briggs (ISCA 1986):

Memory consistency

Page 21: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 21

Outline

• More history• Anthology• Challenges• Conclusion

Page 22: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 22

Caches are great. Yes … but

• Caches are poorly utilized– Lots of dead lines (only 20 % efficiency - Burger et al

1995)

– Squandering of memory bandwidth

• The “memory wall”– At the limit, it will take longer to load a program on-

chip than to execute it (Wulf and McKee 1995)

Page 23: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 23

Solution Paradigms

• Revolution• Evolution• Enhancements

Page 24: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 24

Revolution

Page 25: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 25

Evolution (processor in memory;application specific)

• IRAM (Patterson et al. 1997)– Vector processor; data stream apps; low power

• FlexRAM (Torrellas et al. 1999)– Memory chip = Simple multiprocessor + superscalar + banks of

DRAM; memory intensive apps.

• Active Pages (Chong et al. 1998)– Co-processor paradigm; reconfigurable logic in memory; apps

such as scatter-gather

• FBRAM (Deering et al. 1994)– Graphics in memory

Page 26: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 26

Enhancements

• Hardware and software cache assists– Examples: “hardware tables”; most common case

resolved in hardware less common in software

• Use real estate on-chip to provide intelligence for managing on-chip and off-chip hierarchy– Examples: memory controller, prefetch engines for L2

on processor chip

Page 27: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 27

General Approach

• Identify a cache parameter/enhancement whose tuning will lead to better performance

• Assess potential margin of improvement• Propose and design an assist• Measure efficiency of the scheme

Page 28: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 28

Identify a cache parameter/enhancement

• The creative part!• Our current projects

– Dynamic line sizes

– Modified LRU policies using detection of temporal locality

– Prefetching in L2

Page 29: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 29

Assess potential margin of improvement

• Metrics?– Miss rate; bandwidth; average memory access time

– Weighted combination of some of the above

– Execution time

• Compare to optimal (off-line) algorithm– “Easy” for replacement algorithms– “OK” for some other metrics (e.g., cost of a cache miss

depending on line size; oracle for prefetching)

– Hard for execution time

Page 30: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 30

Measure efficiency of the scheme

• Same problem: metrics?• The further from the processor, the more “relaxed”

the metric– For L1-L2, you need to see impact on execution speed

– For L2- DRAM, you can get away with average memory access time

Page 31: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 31

Anatomy of a Predictor

Exec. Event selec. Pred. Index.

Pred. Mechan.Feedback

Recovery?

Page 32: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 32

Anatomy of a Cache Predictor

Exec. Event selec. Pred. Index.

Pred. Mechan.Feedback

Page 33: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 33

Anatomy of a Cache Predictor

Exec. Pred. trigger. Pred. Index.

Pred. Mechan.Feedback

Load/storecache miss

Page 34: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 34

Anatomy of a Cache Predictor

Exec. Pred. trigger. Pred. Index.

Pred. Mechan.Feedback

PC; EA; global/local history

Page 35: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 35

Anatomy of a Cache Predictor

Exec. Pred. trigger. Pred. Index.

Pred. Mechan.Feedback

One level table Two level tables Associative buffers Specialized caches

Page 36: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 36

Anatomy of a Cache Predictor

Exec. Pred. trigger. Pred. Index.

Pred. Mechan.Feedback

Counters Stride predictors Finite context Markov pred.

Page 37: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 37

Anatomy of a Cache Predictor

Exec. Pred. trigger. Pred. Index.

Pred. Mechan.Feedback

Often imprecise

Page 38: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 38

Applying the Model

• Modified LRU policies for L2 caches• Identify a cache parameter

– L2 cache miss rate

Page 39: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 39

Applying the Model

• Modified LRU policies for L2 caches• Identify a cache parameter• Assess potential margin of improvement

– OPT vs. LRU

Page 40: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 40

Applying the Model

• Modified LRU policies for L2 caches• Identify a cache parameter• Assess potential margin of improvement• Propose a design

– On-line detection of lines exhibiting temporal locality

Page 41: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 41

Propose a Design

Exec. Event selec. Pred. Index.

Pred. Mechan.Feedback

L1 cache missEA PC

Metadata in L2 Locality Table

LRU stack + locality bit

Page 42: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 42

Applying the Model

• Modified LRU policies for L2 caches• Identify a cache parameter• Assess potential margin of improvement• Propose a design• Measure efficiency of the scheme

– How much of the margin of improvement was reduced (i.e., compare with OPT and LRU)

Page 43: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 43

Conclusion

• Do we need more?• “We need substantive research on the design of

memory hierarchies that reduce or hide access latencies while they deliver the memory bandwidths required by current and future applications” PITAC Report Feb 1999

Page 44: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 44

Possible important areas of research

• L2- DRAM interface– Prefetching

• Better cache utilization– Data placement

• Caches for low-power design• Caches for real-time systems

Page 45: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 45

With many thanks to

• Jim Archibald

• Wen-Hann Wang

• Sang Lyul Min

• Rick Zucker

• Tien-Fu Chen

• Craig Anderson

• Xiaohan Qin

• Dennis Lee

• Peter Vanvleet

• Wayne Wong

• Patrick Crowley

Page 46: 2K papers on caches by Y2K: Do we need more?

1/11/00 HPCA-6 46

– Pour la version française, qui était Roi de France en l’an 1000?

– Robert II Le Pieux, fils ainé de Hughes Capet