21
http://www.cs.utsa.edu http://www.cs.utsa.edu Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer In International Symposium on Microarchitecture (MICRO) , December 2010 Presented by: Yingying Tian

Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

  • Upload
    romeo

  • View
    36

  • Download
    0

Embed Size (px)

DESCRIPTION

Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer In International Symposium on Microarchitecture (MICRO) , December 2010 Presented by: Yingying Tian. Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies. - PowerPoint PPT Presentation

Citation preview

Page 1: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.eduhttp://www.cs.utsa.edu

Achieving Non-Inclusive Cache Performancewith Inclusive Caches

Temporal Locality Aware (TLA) Cache Management Policies

Aamer Jaleel, Eric Borch, Malini Bhandaru,Simon Steely Jr., Joel Emer

In International Symposium on Microarchitecture (MICRO), December 2010

Presented by: Yingying Tian

Page 2: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

High Performing Cache Hierarchy in CMPs

• Cache Hierarchy• Multiple interacting caches on chip • Tradeoff between cache latency and hit rate

• Chip-Multi Processors (CMPs) widen the gap between processor and memory speeds

Goal: efficient and high performingcache hierarchy

Page 3: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Key issue: Inclusion or Not?

*Some materials are taken from original presentation slides

Size of the cache hierarchy v.s. Simplicity of the cache coherence

Page 4: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

• Simplify cache coherence• Waste cache capacity (= size of the LLC)• Inclusion property causes invalidation of

blocks that keep high temporal locality in core caches – back invalidate problem

hundreds of cycles memory access penalty

Inclusive Caches

Page 5: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Back-Invalidate Problem• Inclusion property: all the higher-level caches be a

subset of the last-level cache (LLC). • Back-invalidation: When a block is evicted from the

LLC, inclusion is enforced by invalidating that block from all the caches in the hierarchy.

-- Inclusion Victim• Small caches filter temporal locality inclusion

victims keep temporal locality -- Hot Inclusion

Victim

Page 6: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

• Consider following access pattern in a 2-level inclusive cache hierarchy: … a, b, a, c, a, d, a, e, a, f…

Back-Invalidate Problem (Cont.)

a

a

b a

b a

b a

a b

c b a

c a

c b a

a c

d c b a

d a

d c b a

a d

MRU LRUReference ‘e’ misses and evicts ‘a’ from hierarchyNext Reference to ‘a’ misses. While ‘a’ keeps high temporal locality in L1.

L1:

L2: e d c b

e d

Page 7: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Back-Invalidate Problem (Cont.)

Intel Core i7– 1:8 cache ratio, inclusive LLCs.

AMD Phenom -- 1:4 cache ratio, non-inclusive LLCs.Ⅱ

Page 8: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

• Goal: to implement efficient and high performing cache hierarchy

• by eliminating hot inclusion victims to improve inclusive cache performance

Temporal Locality Aware Cache Management Polices

Page 9: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Outline

• Background and motivation• Problem description• Temporal Locality Aware (TLA) Cache

Management Policy Suite• Evaluation• Conclusion

Page 10: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

3 Temporal Locality Aware (TLA) Cache Management Policies: •Temporal Locality Hints (TLH)•Early Core Invalidation (ECI)•Query Based Selection (QBS)

Page 11: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Temporal Locality Hints (TLH)

conveys the temporal locality of hot blocks in core caches by sending hints to the LLC on eachhit of core caches to update the replacement state of that block in LLC.• Significantly reduce the number

of inclusion victims• The number of requests to the

LLC is extremely large and does notscale well with increasing number of cores

(even with filter optimizations)• Limit study

Page 12: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Early Core Invalidation (ECI)

• derives the temporal locality of a block before its becomes LRU in the LLC. The LLC chooses the block located at [LRU-1] position and invalidates it in the core caches while keeping it in the LLC• by observing the core’s subsequent request, the LLC derives the temporal locality• occurs on each LLC miss

Page 13: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Early Core Invalidation (ECI) cont.• Early-invalidated block – ECI block

• ECI block is hot in certain core cache re-requested by that core cache L1 miss but LLC hit, move back to MRU in LLC to keep the temporal locality

• ECI block is not hot (not re-requested or re-requested after a long time)

evicted from the LLC on next LLC miss in the corresponding set

• Lower traffic solution (# of LLC misses is much smaller)

• low-accurate prediction (predict the ECI block is hot in core caches)

what if the ECI block is hot, but not that hot?

Page 14: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Query Based Selection (QBS)

• infers the temporal locality of a block in the LLC by query the core caches on each LLC miss

• The LLC selects a replacement candidate and queries all core caches if this block is present in certain core caches.

• Only replace the block that is not present in any core caches.

• If the QBS block is present in certain core cache. The LLC updates the corresponding replacement state to MRU and re-select, re-query another replacement candidate.

Page 15: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

• The QBS victim selection process is hidden by memory latency.

• The cache controller can limit the number of queries issued on an LLC miss.

• Based on the experiments, sending 2 queries is sufficient to achieve performance benefits.

• Performs similar to a non-inclusive cache hierarchy.

• The on-chip communication overhead is extremely large. [not mentioned in the paper]

Query Based Selection (QBS) Cont.

Page 16: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

An example (. . . a, b, a, c, a, d, a, e, a, f, a, . . . . )

Page 17: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Experimental Methodology

•CMP$im: x86 simulator

•Baseline: 2-core CMP, 3 level inclusive cache hierarchy• L1 I/D: 4-way, 32KB, 64B block size, 1 cycle access latency

• L2: 8-way, 256KB, 64B block size, 10 cycles access latency, non-inclusive

• L3 (LLC): Shared, 16-way, 2MB, 24 cycles access latency, enforce inclusion

• Main memory: 150 cycles access latency

•Benchmarks: 15 benchmarks selected from SPEC CPU 2006 benchmark suite based on program behaviors (core cache fitting, LLC fitting, LLC thrashing, 5 benchmarks of each)

•Total workloads: 105 2-core workloads. (15 choose 2)

Page 18: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Performance

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

1.40

0 20 40 60 80 100

Rela

tive

Perf

orm

ance TLH-L1 Non-Inclusive

5.2% 6.1%

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

1.40

0 20 40 60 80 100

Rela

tive

Perf

orm

ance ECI Non-Inclusive

3.4% 6.1%

0.95

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

1.40

0 20 40 60 80 100

Rela

tive

Perf

orm

ance QBS Non-Inclusive

6.6% 6.1%

Page 19: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Performance (Cont.)

1.00

1.05

1.10

1.15

1.20

1.25

1:2 1:4 1:8 1:16

Rela

tive

Perfo

rman

ce

Ratio of MLC Size : LLC Size

TLH-L1 QBS Non-Inclusive Exclusive

QBS performs similar to non-inclusive caches for all cache ratios

Page 20: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Performance (Cont.)

1.00

1.05

1.10

1.15

2-Core 4-Core 8-Core

Rela

tive

Perfo

rman

ce

QBS Non-Inclusive Exclusive

Scalability of QBS in 2-core, 4-core and 8-core CMPs

(1:4 cache size ratio)

Page 21: Aamer Jaleel, Eric Borch, Malini Bhandaru, Simon Steely Jr., Joel Emer

http://www.cs.utsa.edu

Conclusion

• Temporal Locality Aware Cache Management • Retains benefit of inclusion while minimizing back-invalidate problem• TLA managed inclusive cache = performance of non-inclusive cache

Thanks! Questions?