Upload
lakshmi-yasaswi-kamireddy
View
134
Download
2
Embed Size (px)
Citation preview
SURVEY ON CACHE REPLICATION MECHANISMS
BY
LAKSHMI YASASWI KAMIREDDY
(651771619)
CONTENTS
Abstract
1. Introduction
2. Background
3. Schemes
3.1. Victim Replication
3.2. Adaptive Selective Replication
3.3. Adaptive Probability Replication
3.4. Dynamic Reusability based Replication
3.5. Locality Aware Data Replication
4. Results
5. Conclusions
6. References
Abstract
Present day systems have a high demand of multicore processors on chip. As the number of cores on Chip Multi-
Processor (CMP) increases, the need for effective utilization (management) of the cache increases. Cache Management
plays an important role in improving the performance. This is achieved by reducing the number of misses and the miss
latency. These two factors the number of misses and the miss latency cannot be reduced at the same time. Some CMPs use
a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses while others use private L2
caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals
use selective replication to make a balance between the miss latency and on chip capacity. There are two kinds of
replication Static replication and Dynamic replication. This paper focusses more on the existing dynamic replication
schemes and gives an analysis of each scheme on several benchmarks.
1. Introduction
Upcoming generation multicore processors and applications will operate on massive data. Major challenge in near future
multicore processors is the movement of data that is being incurred by conventional cache hierarchies. This has very high
impact on the off-chip bandwidth, on-chip memory access latency and energy consumption. A large on-chip cache is
possible but it is not a scalable solution. It is limited to small number of cores, and hence the only practical option is to
physically distribute memory in pieces so that every core is near some portion of the cache. Such a solution might provide
a large amount of aggregate cache capacity and fast private memory for each core but at the same time it is difficult to
manage the distributed cache and network resources efficiently as they require architectural support for cache coherence
and consistency under the ubiquitous shared memory model. Most directory-based protocols enable fast local caching to
exploit data locality, but even they have scalability issues .Some of the most recent proposals have addressed the issue of
directory scalability in single-chip multicores using sharer compression techniques or limited directories. But, the fast
private caches still suffer from two major problems: (1) due to capacity constraints, they cannot hold the working set of
applications that operate on massive data, and (2) due to frequent communication between cores, data is often displaced
from them [1]. This has led to an increased network traffic and request rate to the last level cache. On-chip wires do not
scale at the same pace as transistors, because of which the data movement not only impacts memory access latency, but
also consumes more power due to the energy consumption of network and cache resources [2]. Though private LLC
organizations (e.g., [3]) have low hit latencies, their off-chip miss rates are high in applications that have uneven
distributions of working sets or exhibit high degrees of sharing (due to cache line replication). Shared LLC organizations
(e.g., [4]), on the other hand, lead to non-uniform cache access (NUCA) [5] that hurts on-chip locality, but their off-chip
miss rates are low since cache lines are not replicated. Several proposals have explored the idea of hybrid LLC
.Replication mechanisms have been proposed to balance between access latency and cache capacity in hybrid L2 cache
designs [6] [7]. Two types of replication approaches have been proposed: static [8, 9] and dynamic [10, 11, 12, 13, and
14]. In static replication, a data block is placed through predefined address interleaving; therefore, the LLC banks that
may contain that data block is fixed. The data placement of instruction pages in R-NUCA [8] and in S-NUCA [9] are
static. In dynamic replication, a data block can be placed in any LLC banks. Victim Replication [10] ,Adaptive Selective
Replication [11] ,Adaptive Probability Replication [12],Dynamic Reusability based Replication[13], and Locality Aware
data replication at Last Level Cache [14] fall into this category. These replication mechanisms have their own advantages
and disadvantages .The paper will be an analysis these dynamic replication schemes.
2. Background
Starting chronologically the first dynamic replication mechanism from the above mentioned is the Victim Replication
(VR)[10] mechanism which is based on shared caches, but it tries to capture evictions from the local primary cache in the
local L2 slice to reduce subsequent access latency to the same cache block. Victim replicas and global L2 cache blocks
share L2 slice capacity. In VR, all primary cache misses must first check the local L2 tags in case there’s a valid local
replica. On a replica miss, the request is forwarded to the home tile. On a replica hit, the replica is invalidated in the local
L2 slice and moved into the primary cache 10]. The next technique introduced is the Adaptive Selective Replication
(ASR) [11] which adopts similar replication mechanism to VR, but it focuses on the capacity contention between replicas
and global L2 cache blocks. ASR dynamically estimates the cost (extra misses) and benefit (lower hit latency) of
replication and adjusts the number of receivable victims to avoiding hurting L2 cache performance [11]. Another
replication scheme called the Adaptive Probability Replication (APR)[12] mechanism is proposed that counts each
block’s accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to
estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR
adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and
insert them at appropriate position, according to their re-reference probability [12].In the same conference another
mechanism named Dynamic Reusability-based Replication (DRR) [13] was introduced. DRR is a hybrid cache
architecture that dynamically monitors the reuse pattern of cache blocks and replicates blocks with high reusability to
appropriate L2 cache slices [13]. Replicas are shared by nearby cores through a fast lookup mechanism, Network Address
Mapping, which records the location of the nearest replica in network interfaces and forwards subsequent L1 miss
requests to the replica immediately. This improved performance of shared caches by exploiting reusability based
replication, fast lookup mechanism, and replicas sharing. Most recent technique introduced is the locality-aware selective
data replication protocol for the last-level cache (LLC) [14]. This method gives lower memory access latency and energy
by replicating only high locality cache lines in the LLC slice of the requesting core, and simultaneously keeps the off-chip
miss rate low. This approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality
at the cache line granularity, and only allows replication for cache lines with high reuse [14]. A classifier is used to
capture the LLC pressure at the existing replica locations and adaptation of replication decision is done accordingly. The
locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional
coherence protocols. The following sections discuss the schemes in detail.
3. Schemes
3.1. Victim Replication (VR)
Victim replication (VR) is a hybrid scheme that combines the advantage of large capacity of shared L2 cache and low hit
latency of Private L2 cache. VR is primarily based on shared L2 cache, but in addition tries to capture evictions from the
local primary cache in the local L2 slice. Each retained victim is a local L2 replica of a line that is already existing in the
L2 of the remote home tile. When a miss occurs at the shared L2 cache, a line is brought in from memory and placed in
the on chip L2 at a home tile determined by a subset of the physical address bits, as in shared L2 cache. The requested line
is directly forwarded to the primary cache of the requesting processor. If the line’s residency in the primary cache is
terminated because of an incoming invalidation or write back request, the usual shared L2 cache protocol is followed. If a
primary cache line is evicted because of a conflict or capacity miss, then a copy of the victim line in the local slice is kept
to reduce subsequent access latency to the same line A global line with remote sharers is never evicted in favor of a local
replica, as an actively cached global line is likely to be in use. The VR replication policy will replace the following classes
of cache lines in the target set in descending priority order: (1) An invalid line; (2) A global line with no sharers; (3) An
existing replica. If there are no lines belonging to these three categories, no replica is made and the victim is evicted from
the tile as in shared L2 cache [10]. If there is more than one line in the selected category, VR picks at random. All primary
cache misses first check the local L2 tags in case there’s a valid local replica. On a replica miss, the request is forwarded
to the home tile. On a replica hit, the replica is invalidated in the local L2 slice and moved into the primary cache. When a
downgrade or invalidation request is received from the home tile, the L2 tags will also be checked in addition to the
primary cache tags [10].
3.2. Adaptive Selective Replication (ASR)
Adaptive Selective Replication ASR obtains the optimum replication level by balancing the benefits of replication against
the costs.L2 cache block replication improves memory system performance when the average L1 miss latency is reduced.
The following equation describes the average cycles for L1 cache misses normalized by instructions executed:
𝐿1 𝑚𝑖𝑠𝑠 𝑐𝑦𝑐𝑙𝑒𝑠
𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛=
𝑃𝑙𝑜𝑐𝑎𝑙𝐿2 ∗ 𝐿𝑙𝑜𝑐𝑎𝑙𝐿2
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)+
𝑃𝑟𝑒𝑚𝑜𝑡𝑒𝐿2 ∗ 𝐿𝑟𝑒𝑚𝑜𝑡𝑒𝐿2
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)+
𝑃𝑚𝑖𝑠𝑠 ∗ 𝐿𝑚𝑖𝑠𝑠
(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)
Px is the probability of a memory request being satisfied by the entity x, where x is a local L2 cache, remote L2 caches, or
main memory and Lx equals the latency of each entity [11.] The combination of the localL2 and remoteL2 terms represent
the memory cycles spent on L2 cache hits and the third term depicts the memory cycles spent on L2 cache misses.
Replication increases the probability that L1 misses hit in the local L2 cache, thus the PlocalL2 term increases and the
PremoteL2 term decreases. Because the latency of a local L2 cache hit is tens of cycles faster than a remote L2 cache hit,
the net effect of increasing replication is a reduction in cycles spent on L2 cache hits. However, more replication devotes
more capacity to replica blocks, thus fewer unique blocks exist on-chip, increasing the probability of L2 cache misses,
Pmiss. If the probability of a miss increases significantly due to replication, the miss term will dominate, as the latency of
memory is hundreds of cycles greater than the L2 hit latencies. Therefore, balancing these three terms is necessary to
improve memory system performance.
Optimal performance often arises from an intermediate replication level. Figure 1 graphically depicts this tradeoff. The
Replication Benefit curve, Figure 1(a), illustrates the trend that increasing replication reduces L2 cache hit cycles. Due to
the strong locality of shared read-only requests, a small degree of L2 replication can significantly reduce L2 hit cycles by
moving many previous remote L2 hits into the local cache. In contrast, increased replication gradually reduces L2 hit
cycles because fewer unique blocks on-chip lead to fewer total L2 hits. The Replication Cost curve, Figure 1(b), illustrates
that increasing L2 replication increases the memory cycles spent on off-chip misses. The Replication Effectiveness curve,
Figure 1(c), combines the benefit and cost curves and plots the total memory cycles. Because the benefit and cost curves
are generally convex and have opposite slopes, the minimum of the Replication Effectiveness curve often lies between
allowing all replications and no replications. ASR estimates the slopes of the benefit and cost curves to approximate the
optimal replication level.
(a) (b) (c)
Figure 1[11]
By dynamically monitoring the benefit and cost of replication, ASR attempts to achieve the optimal level of replication.
ASR identifies discrete replication levels and makes a piecewise approximation of the memory cycle slope [11]. Thus
ASR simplifies the analysis to a local decision of whether the amount of replication should be increased, decreased, or
remain the same. Figure 1 illustrates the case where the current replication level, labeled C, results in HC hit cycles-per-
instruction and MC Miss cycles-per-instruction. ASR considers three alternatives: (i) increasing replication to the next
higher level, labeled H, (ii) decreasing replication to the next lower level, labeled L, or (iii) leaving the replication
unchanged [11]. To make this decision, ASR not only needs HC and MC, but also four additional hit and miss cycles-per-
instruction values: HH and MH for the next higher level and HL and ML for the next lower level. To simplify the
collection process, ASR estimates only the four differences between the hit and miss cycles-per-instruction: (1) the benefit
of increasing replication (decrease in L2 hit cycles, HC - HH); (2) the cost of increasing replication (increase in L2 miss
cycles, MH - MC); (3) the benefit of decreasing replication, (decrease in L2 miss cycles, MC - ML); and (4) the cost of
decreasing replication (increase in L2 hit cycles, HL - HC). By comparing these cost and benefit counters, ASR will
increase, decrease, or leave unchanged the replication level.
3.3. Adaptive Probability Replication (APR)
This design is based on a distributed shared L2 cache design. To predict re-reference probability, APR adds a counter for
each cache block to record and transfer the number of accesses. In APR, each tile stores re-reference probability of blocks
from other remote L2 cache slices in its network interface component using a simple lookup table called Re-Reference
Probability Buffer (RRPB) [12]. RRPB keeps re-reference probability entries for all other L2 slices. Re-reference
probability entry holds replication thresholds for different number of accesses. The replication thresholds indicate the re-
reference probability of blocks with different number of accesses. In the local L2 slice, if there is invalid block or the
victim is not a sharing global block, the replica is filled into L2 cache slice. Otherwise, the replication is abandoned. The
insert position of replica is determined by its corresponding re-reference probability. When a replica is accessed again, it
is deleted from the local L2 cache slice and moved to the local L1 cache.
APR counts every accesses of L2 cache blocks and records the number of evicted blocks with different number of
accesses, to estimate re-reference probability at runtime. For example, the re-reference probability of block with N
accesses is the proportion of the number of blocks with exceeding N accesses accounts for that of blocks with not less
than N accesses. The estimation is only performed when a global block replacement occurs (not a replica
replacement).The re-reference probabilities are updated to all other tiles in a certain interval (such as 10000 cycles) by
attaching it to any response message. Because blocks from other remote L2 slices may be accessed in local L2 cache
slices due to replication, each replica access increases also the corresponding counter associating with the block. When a
replica is accessed, the associated counter will also be moved to L1 cache block. The values of counter of blocks in L1
caches are sent back when the blocks are evicted to the home L2 slice to accelerate the number of accesses.
Like ASR, a linear feedback shift register generates a pseudo-random number which is compared to the corresponding
replication threshold. When an evicted L1 block passes through the network interface, APR captures this message, and
looks up corresponding RRPB entry according to its address. If the corresponding replication threshold to the number of
accesses of the blocks is less than the generated random number, the block is sent to the local L2 slice. Otherwise, it is
evicted to the home L2 slice. In the local L2 cache slice, if there is an invalid block or the victim is not a sharing global
block, the block is inserted. Otherwise, it is sent to the home L2 slice. Blocks with more accesses have higher re-reference
probability. Probability insertion is implemented in APR according to the number of replicated block, in which the
number of accesses indicates the insert position. If the number of accesses of block exceeds the way size, the block is
inserted at MRU position. The aim of probability insertion is to make blocks with lower re-reference probability survival
for a shorter time.
3.4. Dynamic Reusability based Replication (DRR)
DRR dynamically replicates blocks with high reusability to other appropriate L2 cache slices, and allows the replicas be
shared by nearby cores via a fast lookup mechanism [13]. A set-associative Core Access Counter Buffer (CACB) is used
to determine which block should be replicated and corresponding destination of replication. For recent accessed blocks,
CACB record access numbers for cores exceeding certain hops (for example, 2 hops, this is also the smallest distance
among home slice and replica slices) away from the home slice respectively. So, only 10 counters are enough in one
CACB entry for a 16-core CMP. When the block receives a Read request from one core, the corresponding counter
increases. Due to coherence problem, when the block receives a Write request, all the counters of the block are reset to
zero. In CACB, larger counter means higher reusability. When the maximum counter of a block reaches a certain
threshold (for example, 5), the block is to be replicated to the slice corresponding to the maximum counter.
After the replicating block and the destination being determined, the L2 cache slice sends a replication request to the
destination. When the destination receives the replication request, it allocates cache space to hold the replica. If the
destination has available space for replica, it response acknowledge to the home L2 cache slice. Otherwise, it response fail
message to the home L2 cache slice. Once the replication operation is completed, the replication destination is stored into
a set-associative Replication Directory Buffer (RDB) in home L2 slice. If the replication fails, the destination is not stored
into RDB. When a Read request reaches the home L2 cache slice, if the distance between the requesting core and the
nearest replica is less than the given replica distance(for example, 3 hops in 16-core CMP), the request is forwarded to the
nearest replica. Otherwise, the request is satisfied at the home L2 cache slice.
When the replica receives the forwarded request, it response data to the requesting core. When the data response message
passes through the network interface of the requesting core, the replica’s location is stored into a set-associative Network
Address Mapping Buffer (NAMB). NAMB is embedded into network interface and used to record replicas’ location
which have serviced for the core. When a L1 cache Read miss request passes through network interface, it first searches
NAMB. If the NAMB hit, the request is forwarded to the recoded replica location immediately. Otherwise, the request
continues to transfer to the home L2 cache slice. For coherency maintenance, when a L1 cache write miss request passes
through network interface, it transfers to the home L2 cache slice and does not search NAMB. This is to ensure write
operation can be serialized at the unique home L2 cache slice. Because NAMB is embedded into the network interface, its
access latency can be hidden in other network interface operations.
3.5. Locality Aware Data Replication at Last Level Cache
Run-length is defined as the number of accesses to a cache line (at the LLC) from a particular core before a conflicting
access by another core or before it is evicted. Greater the number of accesses with higher run-length, greater is the benefit
of replicating the cache line in the requester’s LLC slice. Instructions and shared-data (both read-only and read-write) can
be replicated if they demonstrate good reuse. It is also important to adapt the replication decision at runtime in case the
reuse of data changes during an application’s execution.
On an L1 cache read miss, the core first looks up its local LLC slice for a replica. If a replica is found, the cache line is
inserted at the private L1 cache. A Replica Reuse counter at the LLC directory entry is incremented. The replica reuse
counter is a saturating counter used to capture reuse information. It is initialized to ‘1’ on replica creation and incremented
on every replica hit. On the other hand, if a replica is not found, the request is forwarded to the LLC home location. If the
cache line is not found there, it is either brought in from the off-chip memory or the underlying coherence protocol takes
the necessary actions to obtain the most recent copy of the cache line. A replication mode bit is used to identify whether a
replica is allowed to be created for the particular core and a home reuse counter is used to track the number of times the
cache line is accessed at the home location by the particular core. This counter is initialized to ‘0’ and incremented on
every hit at the LLC home location. If the replication mode bit is set to true, the cache line is inserted in the requester’s
LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented. If this counter has reached the
Replication Threshold (RT), the requesting core is “promoted” and the cache line is inserted in its LLC slice and private
L1 cache. If the home reuse counter is still less than RT, a replica is not created. The cache line is only inserted in the
requester’s private L1 cache [14].
On an L1 cache write miss for an exclusive copy of a cache line, the protocol checks the local LLC slice for a replica. If a
replica exists in the Modified (M) or Exclusive (E) state, the cache line is inserted at the private L1 cache. In addition, the
Replica Reuse counter is incremented. If a replica is not found or exists in the Shared(S) state, the request is forwarded to
the LLC home location. The directory invalidates all the LLC replicas and L1 cache copies of the cache line, thereby
maintaining the single-writer multiple-reader invariant. On an invalidation request, both the LLC slice and L1 cache on a
core are probed and invalidated. If a valid cache line is found in either caches, an acknowledgement is sent to the LLC
home location. In addition, if a valid LLC replica exists, the replica reuse counter is communicated back with the
acknowledgement. The locality classifier uses this information along with the home reuse counter to determine whether
the core stays as a replica sharer. If the (replica +home) reuse is greater than or equal to the RT, the core maintains replica
status, else it is demoted to non-replica status. When an L1 cache line is evicted, the LLC replica location is probed for the
same address. If a replica is found, the dirty data in the L1 cache line is merged with it, else an acknowledgement is sent
to the LLC home location. However, when an LLC replica is evicted, the L1 cache is probed for the same address and
invalidated. An acknowledgement message containing the replica reuse counter is sent back to the LLC home location. If
the replica reuse is greater than or equal RT, the core maintains replica status, else it is demoted to non-replica status.
After all acknowledgements are processed, the Home Reuse counters of all non-replica sharers other than the writer are
reset to ‘0’. This has to be done since these sharers have not shown enough reuse to be “promoted”. If the writer is a non-
replica sharer, its home reuse counter is modified as follows. If the writer is the only sharer (replica or non-replica), its
home reuse counter is incremented, else it is reset to ‘1’. This enables the replication of migratory shared data at the
writer, while avoiding it if the replica is likely to be downgraded due to conflicting requests by other cores.
4. Results
APR improves performance by 12% on average for splash-2 benchmark over Baseline (shared cache design), 24% for
parsec benchmark over Baseline. VR displays similar performance in splash-2 and parsec benchmarks that is 5% over
Baseline for splash-2, and 4% over Baseline for parsec. ASR shows similar performance in splash-2 benchmark with VR,
but is better than VR in parsec benchmark (15% over Baseline). R-NUCA obtains 2% and 8% performance gains for
splash-2 and parsec benchmarks respectively. It is because that instructions are with strong locality and occupy fewer
capacity in splash-2 and parsec benchmarks. APR demonstrates its stable performance improvement in splash-2 and
parsec benchmarks. Totally, APR improves performance by 21% on average over baseline, by 17% over VR, by 10%
over ASR, and by 15% over R-NUCA. Replication schemes increase the miss rate of L2 cache. Figure 4 and 5 show the
normalized L2 cache miss ratio of evaluated replication schemes for splash-2 and parsec benchmarks respectively. APR
improves L2 miss ratio by as much 49% for splash-2 benchmark, and by 38% for parsec benchmark. Compared to VR and
ASR, APR shows lower L2 miss ratio. This comes from its replication filtering policy and replica insertion policy.
Probability replication filtering policy reduces the contention of L2 cache capacity, and the probability insertion policy
reduces the residency time of replicas. Both policies tend to reduce the impact on L2 limited capacity caused by extra
replicas.
Figure 2: Normalized Execution time for Splash-2 benchmarks
Figure 3: Normalized Execution time for Parsec Benchmark
Figure 4: Normalized Miss ratio for Splash-2 benchmarks
Figure 5: Normalized Miss ratio for Parsec benchmarks
DRR achieves lower read latency over other techniques. Figure 5 shows normalized L2 cache average read latency. VR,
ASR, and R-NUCA do not reduce read latency against Baseline, while DRR reduces read latency by 12%. Such results
show that DRR takes full advantage of benefits of replicas by network address mapping mechanism. Unnecessary extra
search latency offsets benefits of replicas in VR and ASR. R-NUCA’s instruction replication has limited benefit for
splash-2 and parsec benchmarks. Figure 6 shows the normalized execution time. As can be seen, the DRR improves the
total execution time of almost all benchmarks compared to baseline system, VR, ASR, and R-NUCA. The maximum
performance gain happens at dedup benchmark with about 69% performance improvement. The average performance
improvement is about 30% over the baseline system, about 16% over the VR, about 8% over ASR, and about 25% over
R-NUCA. While the performance improvements vary across different benchmarks, DRR does show better performance in
almost all cases indicating the good adaptive feature of reusability-based replication scheme. We measured the L2 cache
miss rate as shown in Figure 7. Compared to the baseline system, VR increases L2 cache miss rate about 162%, ASR
increases about 91%, R-NUCA increases about 67%, and DRR increases about 48%.
Figure 6: Normalized average read latency
Figure 7: Normalized Execution time
Figure 8: Normalized L2 miss ratio
The locality-aware protocol provides better energy consumption and performance than the other LLC data management
schemes. It is important to balance the on-chip data locality and off-chip miss rate and overall, an RT of 3 achieves the
best trade-off. It is also important to replicate all types of data and selective replication of certain types of data by R-
NUCA (instructions) and ASR (instructions, shared read-only data) leads to sub-optimal energy and performance. Overall,
the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and a 4%, 9%, 6% and 13% lower completion
time compared to VR, ASR, R-NUCA and S-NUCA respectively.
Figure 9: Normalized Energy
Figure 10: Normalized Completion Time
Figure 11: Normalized L1 Cache Miss
5. Conclusions
For applications with working sets that fits within the LLC even if replication is done on every L1 cache Miss ;Locality-
aware scheme perform well both in energy and performance. VR is good for applications with high access to shared read-
write data but has higher L2 cache energy than other schemes. Applications with higher accesses to instructions and
shared read only data are benefited by ASR. Locality aware protocol, APR, DRR also perform almost the same in such
cases. Replication of migratory shared data requires creation of a replica in an Exclusive coherence state. The locality-
aware protocol makes LLC replicas for such data when sufficient reuse is detected and hence performs well. Similarly
APR and DRR also perform comparatively well for such kind of data. Application of probability replication and
probability insertion policies to VR and ASR have proven to show better performance than the individual schemes. From
the above analysis it can be understood that it depends on the kind of data the application uses the most and hence the
replication policy has to be chosen basing on the kind of data. However out the techniques discussed it can be seen that
the newly proposed APR, DRR and the Locality Aware Replication Schemes perform better in most cases than the
existing ASR and VR. Again only dynamic replication schemes are discussed above and static replication schemes have
their own benefits in certain types of applications. This can be seen by improved performance of R-NUCA for certain
cases. So a replication scheme should be selected in such a way that it satisfies most of the needs of the CMP i.e. as many
applications as possible.
6. References
[1] G. Kurian, O. Khan, and S. Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the
40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 523–534, New York, NY, USA, 2013.
ACM.
[2] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (NTV) design
opportunities and challenges. In Design Automation Conference, 2012.
[3] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of
the AMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010.
[4] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.
[5] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-
Chip Caches. In International Conference on Architectural Support for Programming Languages and Operating Systems,
2002.
[6] Zhang, M. and K. Asanovic, Victim replication: Maximizing capacity while hiding wire delay in tiled chip
multiprocessors. Proceedings -International Symposium on Computer Architecture, 2005: p. 336-345.
[7] Beckmann, B.M., M.R. Marty, and D.A. Wood, ASR: Adaptive selective replication for CMP caches. Proceedings of
the Annual International Symposium on Microarchitecture, MICRO, 2006: p. 443-454.
[8] Hardavellas, N., M. Ferdman, B. Falsafi and A. Ailamaki, Reactive NUCA: Near-Optimal Block Placement and
Replication in Distributed Caches. the 36th Annual International Symposium on Computer Architecture, 2009: p. 184-
195.
[9] Chang, J.C. and G.S. Sohi, Cooperative caching for chip multiprocessors the 33rd International Symposium on
Computer Archtiecture, Proceedings, 2006: p. 264-275.
[10] Beckmann, B.M. and D.A. Wood, Managing wire delay in large chip-multiprocessor caches. Micro-37 2004: 37th
Annual International Symposium on Microarchitecture, Proceedings, 2004: p. 319-330.
[11] Kandemir, M., F. Li, M. J. Irwin, and S. W. Son, A novel migration-based NUCA design for Chip Multiprocessors.
in High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008.
[12]Jinglei Wang, Dongsheng Wang, Haixia Wang, and Yibo Xue, High Performance Cache Block Replication Using
Re-Reference Probability in CMPs , High Performance Computing (HiPC), 2011 ,18th International Conference .
[13]Jinglei Wang, Dongsheng Wang, Haixia Wang, Yibo Xue, Dynamic Reusability-based Replication with Network
Address Mapping in CMPs, High Performance Computing (HiPC), 2011 ,18th International Conference.
[14]Kurian, G., Devadas, S., Khan, O.: Locality-Aware Data Replication in the Last-Level Cache. In: 20th International
Symposium on High Performance Computer Architecture, pp. 1-12, IEEE Press, New York (2014) .