22
Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Embed Size (px)

Citation preview

Page 1: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Optimizing Shared Caches in Chip Multiprocessors

Samir SapraAthula Balachandran

Ravishankar Krishnaswamy

Page 2: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Core 2 Duo die

“Just a few years ago, the idea of putting multiple processors on a chip was farfetched. Now it is accepted and commonplace, and virtually every new high performance processor is a chip multiprocessor of some sort…”

Center for Electronic System DesignUniv. of California Berkeley

Chip Multiprocessors??

“Mowry is working on the development of single-chip multiprocessors: one large chip capable of performing multiple operations at once, using similar techniques to maximize performance”

-- Technology Review, 1999

Sony's Playstation 3, 2006

Page 3: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

CMP Caches: Design Space

• Architecture– Placement of Cache/Processors– Interconnects/Routing

• Cache Organization & Management– Private/Shared/Hybrid– Fully Hardware/OS Interface

“L2 is the last line of defense before hitting the

memory wall, and is the focus of our talk”

Page 4: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Private L2 Cache

I$ D$ I$ D$

L2 $ L2 $ L2 $ L2 $ L2 $ L2 $

I N T E R C O N N E C T

Coherence Protocol

Offchip Memory

+ Less interconnect traffic+ Insulates L2 units + Hit latency

– Duplication– Load imbalance– Complexity of coherence– Higher miss rate

L1 L1

Proc

Page 5: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Shared-Interleaved L2 Cache

– Interconnect traffic– Interference between cores– Hit latency is higher

+ No duplication+ Balance the load+ Lower miss rate+ Simplicity of coherence

I$ D$ I$ D$

I N T E R C O N N E C T

Coherence ProtocolL1

L2

Page 6: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Take Home Message

• Leverage on-chip access time

Page 7: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Take Home Messages

• Leverage on-chip access time• Better sharing of cache resources• Isolating performance of processors• Place data on the chip close to where it is used • Minimize inter-processor misses (in shared cache)• Fairness towards processors

Page 8: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

On to some solutions…

Jichuan Chang and Gurindar S. SohiCooperative Caching for Chip MultiprocessorsInternational Symposium on Computer Architecture, 2006.

Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia AilamakiReactive NUCA: Near-Optimal Block Placement and Replication in Distributed CachesInternational Symposium on Computer Architecture, 2009.

Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane IrwinAdaptive Set-Pinning: Managing Shared Caches in Chip MultiprocessorsArchitectural Support for Programming Languages and Operating, Systems 2008.

each handles this problem in a different way

Page 9: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Co-operative Caching(Chang & Sohi)

• Private L2 caches• Attract data locally to reduce remote on chip access.

Lowers average on-chip misses.• Co-operation among the private caches for efficient

use of resources on the chip.• Controlling the extent of co-operation to suit the

dynamic workload behavior

Page 10: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

CC Techniques

• Cache to cache transfer of clean data– In case of miss transfer “clean” blocks from another L2 cache.– This is useful in the case of “read only” data (instructions) .

• Replication aware data replacement– Singlet/Replicate.– Evict singlet only when no replicates exist.– Singlets can be “spilled” to other cache banks.

• Global replacement of inactive data– Global management needed for managing “spilling”.– N-Chance Forwarding.– Set recirculation count to N when spilled.– Decrease N by 1 when spilled again, unless N becomes 0.

Page 11: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Set “Pinning” -- Setup

P1

P2

P3

P4

Set 0

Set 1

::

Set (S-1)

L1cache

Processors SharedL2 cache

Interconnect

MainMemory

Page 12: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Set “Pinning” -- Problem

P1

P2

P3

P4

Set 0

Set 1

::

Set (S-1)

MainMemory

Page 13: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Set “Pinning” -- Types of Cache Misses

• Compulsory (aka Cold)

• Capacity• Conflict• Coherence

• Compulsory• Inter-processor• Intra-processor

versus

Page 14: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

P1

P2

P3

P4

MainMemory

POP 1

POP 2

POP 3

POP 4

Set

::

Set

Owner Other bits Data

Page 15: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

R-NUCA: Use Class-Based Strategies

Solve for the common case!Most current (and future) programs have the following types of accesses1. Instruction Access – Shared, but Read-Only2. Private Data Access – Read-Write, but not Shared3. Shared Data Access – Read-Write (or) Read-Only, but Shared.

Page 16: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

R-NUCA: Can do this online!

• We have information from the OS and TLB• For each memory block, classify it as

– Instruction– Private Data– Shared Data

• Handle them differently– Replicate instructions – Keep private data locally – Keep shared data globally

Page 17: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

R-NUCA: Reactive Clustering

• Assign clusters based on level of sharing– Private Data given level-1 clusters (local cache)– Shared Data given level-16 clusters (16 neighboring machines), etc.

Clusters ≈ Overlapping Sets in Set-Associative Mapping• Within a cluster, “Rotational Interleaving”

– Load-Balancing to minimize contention on bus and controller

Page 18: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Future Directions

Area has been closed.

Page 19: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Just Kidding…

• Optimize for Power Consumption• Assess trade-offs between more caches and more cores

• Minimize usage of OS, but still retain flexibility• Application adaptation to allocated cache quotas• Adding hardware directed thread level speculation

Page 20: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Questions?

THANK YOU!

Page 21: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Backup

• Commercial and research prototypes– Sun MAJC– Piranha– IBM Power 4/5– Stanford Hydra

Page 22: Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Backup