24
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades Dept. of Computer Science University of Pittsburgh

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

  • Upload
    sef

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors. Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades. Dept. of Computer Science University of Pittsburgh. Tiled CMP Architectures. - PowerPoint PPT Presentation

Citation preview

Page 1: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

ACM: An Efficient Approach for Managing Shared Caches in Chip

Multiprocessors

Mohammad Hammoud, Sangyeun Cho, and Rami Melhem

Presenter: Socrates Demetriades

Dept. of Computer ScienceUniversity of Pittsburgh

Page 2: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Tiled CMP Architectures

• Tiled CMP Architectures have recently been advocated as a scalable design.

• They replicate identical building blocks (tiles) connected over a switched network on-chip (NoC).

• A tile typically incorporates a private L1 cache and an L2 cache bank.

• A traditional practice of CMP caches is a one that logically shares the physically distributed L2 banks

Shared Scheme

Page 3: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Shared Scheme

L2 miss

1. The home tile of a cache block B is designated by the HS bits of B’s physical address.

2. Tile T1 requests B.3. B is fetched from the main

memory and mapped at its home tile (together with its dir info).

4. Pros:• High capacity utilization.• Simple Coherence Enforcement (Only

for L1).

Page 4: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Shared Scheme: Latency Problem (Cons) • Access latencies to L2 banks differ

depending on the distances between requester cores and target banks.

• This design is referred to as a Non Uniform Cache Architecture

NUCA

Page 5: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

NUCA Solution: Block Migration

• T0 requests block B.• Move accessed blocks closer

to the requesting cores Block Migration

• B is migrated from T15 to T0.• T0 requests B.

Local hit

Total Hops = 14Total Hops = 0

HS of B = 1111 (T15)

Page 6: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

NUCA Solution: Block Migration

• T3 requests B (hops = 6). • T0 requests B (hops = 8).• T8 requests B (hops = 8).• Assume B is migrated to T3.• T3 requests B (hops = 0).• T0 requests B (hops = 11).• T8 requests B (hops = 13).• Though T0 saved 6 hops, in

total there is a loss of 2 hops.

Total Hops = 22Total Hops = 24

HS of B = 0110 (T6)

Page 7: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Our work

Collect information about tiles (sharers) that have accessed a block B.

Depend on the past to predict the future: a core that accessed a block in the past is likely to access it again in the future.

Migrate B to a tile (host) that minimizes the overall number of NoC hops needed.

Page 8: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Talk roadmap

Predicting optimal host location Locating Migratory Blocks

• Cache-the-cache-tag policy.

Replacement policy upon migration• Swap-with-the-lru policy.

Quantitative Evaluation

Conclusion and future works

Page 9: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Predicting Optimal Host Location Keeping a cache block B at its home tile might not be

optimal.

The best host location of B is not known until runtime.

Adaptive Controlled Migration (ACM):• Keep a pattern for the accessibility of B.• At runtime (after a specific migration frequency level is

reached for B) compute the best host to migrate B by finding the one that minimizes the total latency cost between the sharers of B

Page 10: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

ACM: A Working Example

Tiles 0 and 6 are sharers: Case 1: Tile 3 is a host. Case 2: Tile 15 is a host. Case 3: Tile 2 is a host. Case 4: Tile 0 is the host.

Total Latency Cost = 14 Total Latency Cost = 22 Total Latency Cost = 10 Total Latency Cost = 8

Select T0

Page 11: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Locating Migratory Blocks After a cache block B is migrated,

the HS bits of B’s physical address can’t be used anymore to locate B at a subsequent access.

Assume B has been migrated from its home tile T4 to a new host tile T7.

T3 requests B: L2 miss.

A tag can be kept at T4 to point to T7.

Scenario: 3-way cache-to-cache transfer (T3, T4, and T7)

Deficiencies:• Useless migration.• Fails to exploit distance locality

False L2 Miss

HS of B = 0100 (T4)

B at T7

Page 12: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Locating Migratory Blocks: cache-the-cache-tag Policy

Idea: cache the tag of block B at the requester’s tile (within a data structure referred to as MT table).

T3 requests B. It looks up its MT table before reaching B’s home tile.• MT miss: 3-way

communication (first access).

T3 caches B’s tag at its MT table.

T3 requests B. It looks up its MT table before reaching B’s home tile.• MT hit: direct fetch

(second-and up-accesses)

HS of B = 0100 (T4)

MT MissMT Hit

Page 13: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Locating Migratory Blocks: cache-the-cache-tag Policy The MT table of a tile T can now hold 2 types of tags:

• A tag for each block B whose home tile is T and had been migrated to another tile (local entry).

• Tags to keep track of the locations of the migratory blocks that have been recently accessed by T but whose home tile is not T (remote entry).

The MT table replacement policy:• An invalid tag.• The LRU remote entry.

The MT remote and local tags of B are kept consistent via extending the local entry of B at B’s home tile by a bit mask that indicates which tiles have cached corresponding remote entries.

Page 14: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Replacement Policy Upon Migration: swap-with-lru Policy After the ACM algorithm predicts the optimal host, H, for a

block B, a decision is to be made regarding which block to replace at H upon migrating B.

Idea: Swap B with the LRU block at H (swap-with-the-lru policy).

The LRU block at H could be:• A migratory one.• A non-migratory one.

The swap-with-the-lru policy is very effective especially for workloads that have working sets which are large relative to L2 banks (bears similarity to victim replication but more robust)

Page 15: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: Methodology and Benchmarks. We simulate a 16-way tiled CMP. Simulator: Simics 3.0.29 (Solaris OS) Cache line size: 64 Bytes. L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles. L2 size/ways/latency: 512KB per bank/16 ways/6 cycles. Latency per hop: 5 cycles. Memory latency: 300 cycles. Migration frequency level: 10 Benchmarks: Name Input

SPECjbb Java HotSpot ™ server VM 1.5, 4 warehouses

Lu 1024*1024 (16 threads)

Ocean 514*514 (16 threads)

Radix 2 M integers (16 threads)

Barnes 16K particles (16 threads)

Parser, Art, Equake, Mcf, Ammp, Vortex

Reference

MIX1 (vortex, Ammp, Mcf, and Equake)

MIX2 (Art, Equake, Parser, Mcf)

Page 16: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: Single-threaded and Multiprogramming Results

VR successfully offsets the miss rate from fast replica hits for all the single-threaded benchmarks. VR fails to offset the L2 miss increase of MIX1 and MIX2. For single-threaded workloads: ACM generates on average 20.5% and 3.7% better AAL than S and VR, respectively. For multiprogramming workloads: ACM generates on average 2.8% and 31.3% better AAL than S and VR

Poor Capacity UtilizationPoor Capacity Utilization

Maintains Efficient Capacity UtilizationMaintains Efficient Capacity Utilization

Page 17: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: Multithreaded Results

An increase in the degree of sharing suggests that the capacity occupied by replicas could increase significantly leading to a decrease in the effective L2 cache size. ACM exhibits AALs that are on average 27% and 37.1% better than S and VR, respectively.

Page 18: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: Avg. Memory Access Cycles Per 1K Instr.

ACM performs on average 18.6% and 2.6% better than S for the single-threaded and multiprogramming workloads, respectively.ACM performs on average 20.7% better than S for multithreaded workloads. VR performs on average 15.1% better than S, and 38.4% worse than S for the single-threaded and multiprogramming workloads, respectively.VR performs on average 19.6% worse than S for multithreaded workloads.

Page 19: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: ACM Scalability

Poor Capacity UtilizationPoor Capacity Utilization

As the number of tiles on a CMP platform increases, the NUCA problem exacerbates. ACM is independent of the underlying platform and always selects hosts that minimize AAL. More Exposure to the NUCA problem translates effectively to a larger benefit from ACM. For the simulated benchmarks: with 16-way CMP, ACM improves AAL by 11.6% over S. With 32-way CMP, ACM improves AAL by 56.6% on average over S.

Page 20: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: Sensitivity to MT table Sizes.

With half (50%) and quarter (25%) MT table sizes as compared to the regular L2 cache bank size, ACM increases AAL by 5.9% and 11.3% over the base one (100% - or identical to the L2 cache bank size).

Page 21: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Quantitative Evaluation: Sensitivity to L2 Cache Sizes.

AAL maintains improvement of 39.7% over S. VR fails to demonstrate stability.

Page 22: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Conclusion This work proposes ACM, a strategy to manage CMP

NUCA caches.

ACM offers:• Better average L2 access latency over traditional NUCA (20.4%

on average).• Maintains L2 miss rate of NUCA.

ACM proposes a robust location strategy (cache-the-cache-tag) that can work for any NUCA migration scheme.

ACM reveals the usefulness of migration technique in CMP context.

Page 23: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

Future works

Improve ACM prediction mechanism.• Currently: Cores are treated equally (we consider only the case

with 0-1 weights assigning 1 for a core that accessed block B and 0 for a one that didn’t).

• Improvement: Reflect the non-uniformity in cores access weights (trade off between access weights and storage overhead).

Propose an adaptive mechanism for selecting migration frequency levels.

Page 24: ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors

ACM: An Efficient Approach for Managing Shared Caches in Chip

MultiprocessorsM. Hammoud, S. Cho, and R. Melhem

Special thank to Socrates DemetriadesDept. of Computer Science

University of Pittsburgh

Thank you!