Upload
sef
View
30
Download
0
Embed Size (px)
DESCRIPTION
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors. Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades. Dept. of Computer Science University of Pittsburgh. Tiled CMP Architectures. - PowerPoint PPT Presentation
Citation preview
ACM: An Efficient Approach for Managing Shared Caches in Chip
Multiprocessors
Mohammad Hammoud, Sangyeun Cho, and Rami Melhem
Presenter: Socrates Demetriades
Dept. of Computer ScienceUniversity of Pittsburgh
Tiled CMP Architectures
• Tiled CMP Architectures have recently been advocated as a scalable design.
• They replicate identical building blocks (tiles) connected over a switched network on-chip (NoC).
• A tile typically incorporates a private L1 cache and an L2 cache bank.
• A traditional practice of CMP caches is a one that logically shares the physically distributed L2 banks
Shared Scheme
Shared Scheme
L2 miss
1. The home tile of a cache block B is designated by the HS bits of B’s physical address.
2. Tile T1 requests B.3. B is fetched from the main
memory and mapped at its home tile (together with its dir info).
4. Pros:• High capacity utilization.• Simple Coherence Enforcement (Only
for L1).
Shared Scheme: Latency Problem (Cons) • Access latencies to L2 banks differ
depending on the distances between requester cores and target banks.
• This design is referred to as a Non Uniform Cache Architecture
NUCA
NUCA Solution: Block Migration
• T0 requests block B.• Move accessed blocks closer
to the requesting cores Block Migration
• B is migrated from T15 to T0.• T0 requests B.
Local hit
Total Hops = 14Total Hops = 0
HS of B = 1111 (T15)
NUCA Solution: Block Migration
• T3 requests B (hops = 6). • T0 requests B (hops = 8).• T8 requests B (hops = 8).• Assume B is migrated to T3.• T3 requests B (hops = 0).• T0 requests B (hops = 11).• T8 requests B (hops = 13).• Though T0 saved 6 hops, in
total there is a loss of 2 hops.
Total Hops = 22Total Hops = 24
HS of B = 0110 (T6)
Our work
Collect information about tiles (sharers) that have accessed a block B.
Depend on the past to predict the future: a core that accessed a block in the past is likely to access it again in the future.
Migrate B to a tile (host) that minimizes the overall number of NoC hops needed.
Talk roadmap
Predicting optimal host location Locating Migratory Blocks
• Cache-the-cache-tag policy.
Replacement policy upon migration• Swap-with-the-lru policy.
Quantitative Evaluation
Conclusion and future works
Predicting Optimal Host Location Keeping a cache block B at its home tile might not be
optimal.
The best host location of B is not known until runtime.
Adaptive Controlled Migration (ACM):• Keep a pattern for the accessibility of B.• At runtime (after a specific migration frequency level is
reached for B) compute the best host to migrate B by finding the one that minimizes the total latency cost between the sharers of B
ACM: A Working Example
Tiles 0 and 6 are sharers: Case 1: Tile 3 is a host. Case 2: Tile 15 is a host. Case 3: Tile 2 is a host. Case 4: Tile 0 is the host.
Total Latency Cost = 14 Total Latency Cost = 22 Total Latency Cost = 10 Total Latency Cost = 8
Select T0
Locating Migratory Blocks After a cache block B is migrated,
the HS bits of B’s physical address can’t be used anymore to locate B at a subsequent access.
Assume B has been migrated from its home tile T4 to a new host tile T7.
T3 requests B: L2 miss.
A tag can be kept at T4 to point to T7.
Scenario: 3-way cache-to-cache transfer (T3, T4, and T7)
Deficiencies:• Useless migration.• Fails to exploit distance locality
False L2 Miss
HS of B = 0100 (T4)
B at T7
Locating Migratory Blocks: cache-the-cache-tag Policy
Idea: cache the tag of block B at the requester’s tile (within a data structure referred to as MT table).
T3 requests B. It looks up its MT table before reaching B’s home tile.• MT miss: 3-way
communication (first access).
T3 caches B’s tag at its MT table.
T3 requests B. It looks up its MT table before reaching B’s home tile.• MT hit: direct fetch
(second-and up-accesses)
HS of B = 0100 (T4)
MT MissMT Hit
Locating Migratory Blocks: cache-the-cache-tag Policy The MT table of a tile T can now hold 2 types of tags:
• A tag for each block B whose home tile is T and had been migrated to another tile (local entry).
• Tags to keep track of the locations of the migratory blocks that have been recently accessed by T but whose home tile is not T (remote entry).
The MT table replacement policy:• An invalid tag.• The LRU remote entry.
The MT remote and local tags of B are kept consistent via extending the local entry of B at B’s home tile by a bit mask that indicates which tiles have cached corresponding remote entries.
Replacement Policy Upon Migration: swap-with-lru Policy After the ACM algorithm predicts the optimal host, H, for a
block B, a decision is to be made regarding which block to replace at H upon migrating B.
Idea: Swap B with the LRU block at H (swap-with-the-lru policy).
The LRU block at H could be:• A migratory one.• A non-migratory one.
The swap-with-the-lru policy is very effective especially for workloads that have working sets which are large relative to L2 banks (bears similarity to victim replication but more robust)
Quantitative Evaluation: Methodology and Benchmarks. We simulate a 16-way tiled CMP. Simulator: Simics 3.0.29 (Solaris OS) Cache line size: 64 Bytes. L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles. L2 size/ways/latency: 512KB per bank/16 ways/6 cycles. Latency per hop: 5 cycles. Memory latency: 300 cycles. Migration frequency level: 10 Benchmarks: Name Input
SPECjbb Java HotSpot ™ server VM 1.5, 4 warehouses
Lu 1024*1024 (16 threads)
Ocean 514*514 (16 threads)
Radix 2 M integers (16 threads)
Barnes 16K particles (16 threads)
Parser, Art, Equake, Mcf, Ammp, Vortex
Reference
MIX1 (vortex, Ammp, Mcf, and Equake)
MIX2 (Art, Equake, Parser, Mcf)
Quantitative Evaluation: Single-threaded and Multiprogramming Results
VR successfully offsets the miss rate from fast replica hits for all the single-threaded benchmarks. VR fails to offset the L2 miss increase of MIX1 and MIX2. For single-threaded workloads: ACM generates on average 20.5% and 3.7% better AAL than S and VR, respectively. For multiprogramming workloads: ACM generates on average 2.8% and 31.3% better AAL than S and VR
Poor Capacity UtilizationPoor Capacity Utilization
Maintains Efficient Capacity UtilizationMaintains Efficient Capacity Utilization
Quantitative Evaluation: Multithreaded Results
An increase in the degree of sharing suggests that the capacity occupied by replicas could increase significantly leading to a decrease in the effective L2 cache size. ACM exhibits AALs that are on average 27% and 37.1% better than S and VR, respectively.
Quantitative Evaluation: Avg. Memory Access Cycles Per 1K Instr.
ACM performs on average 18.6% and 2.6% better than S for the single-threaded and multiprogramming workloads, respectively.ACM performs on average 20.7% better than S for multithreaded workloads. VR performs on average 15.1% better than S, and 38.4% worse than S for the single-threaded and multiprogramming workloads, respectively.VR performs on average 19.6% worse than S for multithreaded workloads.
Quantitative Evaluation: ACM Scalability
Poor Capacity UtilizationPoor Capacity Utilization
As the number of tiles on a CMP platform increases, the NUCA problem exacerbates. ACM is independent of the underlying platform and always selects hosts that minimize AAL. More Exposure to the NUCA problem translates effectively to a larger benefit from ACM. For the simulated benchmarks: with 16-way CMP, ACM improves AAL by 11.6% over S. With 32-way CMP, ACM improves AAL by 56.6% on average over S.
Quantitative Evaluation: Sensitivity to MT table Sizes.
With half (50%) and quarter (25%) MT table sizes as compared to the regular L2 cache bank size, ACM increases AAL by 5.9% and 11.3% over the base one (100% - or identical to the L2 cache bank size).
Quantitative Evaluation: Sensitivity to L2 Cache Sizes.
AAL maintains improvement of 39.7% over S. VR fails to demonstrate stability.
Conclusion This work proposes ACM, a strategy to manage CMP
NUCA caches.
ACM offers:• Better average L2 access latency over traditional NUCA (20.4%
on average).• Maintains L2 miss rate of NUCA.
ACM proposes a robust location strategy (cache-the-cache-tag) that can work for any NUCA migration scheme.
ACM reveals the usefulness of migration technique in CMP context.
Future works
Improve ACM prediction mechanism.• Currently: Cores are treated equally (we consider only the case
with 0-1 weights assigning 1 for a core that accessed block B and 0 for a one that didn’t).
• Improvement: Reflect the non-uniformity in cores access weights (trade off between access weights and storage overhead).
Propose an adaptive mechanism for selecting migration frequency levels.
ACM: An Efficient Approach for Managing Shared Caches in Chip
MultiprocessorsM. Hammoud, S. Cho, and R. Melhem
Special thank to Socrates DemetriadesDept. of Computer Science
University of Pittsburgh
Thank you!