13
SURVEY ON CACHE REPLICATION MECHANISMS BY LAKSHMI YASASWI KAMIREDDY (651771619)

Survey paper _ lakshmi yasaswi kamireddy(651771619)

Embed Size (px)

Citation preview

Page 1: Survey paper _ lakshmi yasaswi kamireddy(651771619)

SURVEY ON CACHE REPLICATION MECHANISMS

BY

LAKSHMI YASASWI KAMIREDDY

(651771619)

Page 2: Survey paper _ lakshmi yasaswi kamireddy(651771619)

CONTENTS

Abstract

1. Introduction

2. Background

3. Schemes

3.1. Victim Replication

3.2. Adaptive Selective Replication

3.3. Adaptive Probability Replication

3.4. Dynamic Reusability based Replication

3.5. Locality Aware Data Replication

4. Results

5. Conclusions

6. References

Page 3: Survey paper _ lakshmi yasaswi kamireddy(651771619)

Abstract

Present day systems have a high demand of multicore processors on chip. As the number of cores on Chip Multi-

Processor (CMP) increases, the need for effective utilization (management) of the cache increases. Cache Management

plays an important role in improving the performance. This is achieved by reducing the number of misses and the miss

latency. These two factors the number of misses and the miss latency cannot be reduced at the same time. Some CMPs use

a shared L2 cache to maximize the on-chip cache capacity and minimize off-chip misses while others use private L2

caches, replicating data to limit the delay due to global wires and minimize cache access time. Recent hybrid proposals

use selective replication to make a balance between the miss latency and on chip capacity. There are two kinds of

replication Static replication and Dynamic replication. This paper focusses more on the existing dynamic replication

schemes and gives an analysis of each scheme on several benchmarks.

1. Introduction

Upcoming generation multicore processors and applications will operate on massive data. Major challenge in near future

multicore processors is the movement of data that is being incurred by conventional cache hierarchies. This has very high

impact on the off-chip bandwidth, on-chip memory access latency and energy consumption. A large on-chip cache is

possible but it is not a scalable solution. It is limited to small number of cores, and hence the only practical option is to

physically distribute memory in pieces so that every core is near some portion of the cache. Such a solution might provide

a large amount of aggregate cache capacity and fast private memory for each core but at the same time it is difficult to

manage the distributed cache and network resources efficiently as they require architectural support for cache coherence

and consistency under the ubiquitous shared memory model. Most directory-based protocols enable fast local caching to

exploit data locality, but even they have scalability issues .Some of the most recent proposals have addressed the issue of

directory scalability in single-chip multicores using sharer compression techniques or limited directories. But, the fast

private caches still suffer from two major problems: (1) due to capacity constraints, they cannot hold the working set of

applications that operate on massive data, and (2) due to frequent communication between cores, data is often displaced

from them [1]. This has led to an increased network traffic and request rate to the last level cache. On-chip wires do not

scale at the same pace as transistors, because of which the data movement not only impacts memory access latency, but

also consumes more power due to the energy consumption of network and cache resources [2]. Though private LLC

organizations (e.g., [3]) have low hit latencies, their off-chip miss rates are high in applications that have uneven

distributions of working sets or exhibit high degrees of sharing (due to cache line replication). Shared LLC organizations

(e.g., [4]), on the other hand, lead to non-uniform cache access (NUCA) [5] that hurts on-chip locality, but their off-chip

miss rates are low since cache lines are not replicated. Several proposals have explored the idea of hybrid LLC

.Replication mechanisms have been proposed to balance between access latency and cache capacity in hybrid L2 cache

designs [6] [7]. Two types of replication approaches have been proposed: static [8, 9] and dynamic [10, 11, 12, 13, and

14]. In static replication, a data block is placed through predefined address interleaving; therefore, the LLC banks that

may contain that data block is fixed. The data placement of instruction pages in R-NUCA [8] and in S-NUCA [9] are

static. In dynamic replication, a data block can be placed in any LLC banks. Victim Replication [10] ,Adaptive Selective

Replication [11] ,Adaptive Probability Replication [12],Dynamic Reusability based Replication[13], and Locality Aware

data replication at Last Level Cache [14] fall into this category. These replication mechanisms have their own advantages

and disadvantages .The paper will be an analysis these dynamic replication schemes.

2. Background

Starting chronologically the first dynamic replication mechanism from the above mentioned is the Victim Replication

(VR)[10] mechanism which is based on shared caches, but it tries to capture evictions from the local primary cache in the

local L2 slice to reduce subsequent access latency to the same cache block. Victim replicas and global L2 cache blocks

share L2 slice capacity. In VR, all primary cache misses must first check the local L2 tags in case there’s a valid local

replica. On a replica miss, the request is forwarded to the home tile. On a replica hit, the replica is invalidated in the local

Page 4: Survey paper _ lakshmi yasaswi kamireddy(651771619)

L2 slice and moved into the primary cache 10]. The next technique introduced is the Adaptive Selective Replication

(ASR) [11] which adopts similar replication mechanism to VR, but it focuses on the capacity contention between replicas

and global L2 cache blocks. ASR dynamically estimates the cost (extra misses) and benefit (lower hit latency) of

replication and adjusts the number of receivable victims to avoiding hurting L2 cache performance [11]. Another

replication scheme called the Adaptive Probability Replication (APR)[12] mechanism is proposed that counts each

block’s accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to

estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR

adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and

insert them at appropriate position, according to their re-reference probability [12].In the same conference another

mechanism named Dynamic Reusability-based Replication (DRR) [13] was introduced. DRR is a hybrid cache

architecture that dynamically monitors the reuse pattern of cache blocks and replicates blocks with high reusability to

appropriate L2 cache slices [13]. Replicas are shared by nearby cores through a fast lookup mechanism, Network Address

Mapping, which records the location of the nearest replica in network interfaces and forwards subsequent L1 miss

requests to the replica immediately. This improved performance of shared caches by exploiting reusability based

replication, fast lookup mechanism, and replicas sharing. Most recent technique introduced is the locality-aware selective

data replication protocol for the last-level cache (LLC) [14]. This method gives lower memory access latency and energy

by replicating only high locality cache lines in the LLC slice of the requesting core, and simultaneously keeps the off-chip

miss rate low. This approach relies on low overhead yet highly accurate in-hardware runtime classification of data locality

at the cache line granularity, and only allows replication for cache lines with high reuse [14]. A classifier is used to

capture the LLC pressure at the existing replica locations and adaptation of replication decision is done accordingly. The

locality tracking mechanism is decoupled from the sharer tracking structures that cause scalability concerns in traditional

coherence protocols. The following sections discuss the schemes in detail.

3. Schemes

3.1. Victim Replication (VR)

Victim replication (VR) is a hybrid scheme that combines the advantage of large capacity of shared L2 cache and low hit

latency of Private L2 cache. VR is primarily based on shared L2 cache, but in addition tries to capture evictions from the

local primary cache in the local L2 slice. Each retained victim is a local L2 replica of a line that is already existing in the

L2 of the remote home tile. When a miss occurs at the shared L2 cache, a line is brought in from memory and placed in

the on chip L2 at a home tile determined by a subset of the physical address bits, as in shared L2 cache. The requested line

is directly forwarded to the primary cache of the requesting processor. If the line’s residency in the primary cache is

terminated because of an incoming invalidation or write back request, the usual shared L2 cache protocol is followed. If a

primary cache line is evicted because of a conflict or capacity miss, then a copy of the victim line in the local slice is kept

to reduce subsequent access latency to the same line A global line with remote sharers is never evicted in favor of a local

replica, as an actively cached global line is likely to be in use. The VR replication policy will replace the following classes

of cache lines in the target set in descending priority order: (1) An invalid line; (2) A global line with no sharers; (3) An

existing replica. If there are no lines belonging to these three categories, no replica is made and the victim is evicted from

the tile as in shared L2 cache [10]. If there is more than one line in the selected category, VR picks at random. All primary

cache misses first check the local L2 tags in case there’s a valid local replica. On a replica miss, the request is forwarded

to the home tile. On a replica hit, the replica is invalidated in the local L2 slice and moved into the primary cache. When a

downgrade or invalidation request is received from the home tile, the L2 tags will also be checked in addition to the

primary cache tags [10].

3.2. Adaptive Selective Replication (ASR)

Adaptive Selective Replication ASR obtains the optimum replication level by balancing the benefits of replication against

the costs.L2 cache block replication improves memory system performance when the average L1 miss latency is reduced.

Page 5: Survey paper _ lakshmi yasaswi kamireddy(651771619)

The following equation describes the average cycles for L1 cache misses normalized by instructions executed:

𝐿1 𝑚𝑖𝑠𝑠 𝑐𝑦𝑐𝑙𝑒𝑠

𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛=

𝑃𝑙𝑜𝑐𝑎𝑙𝐿2 ∗ 𝐿𝑙𝑜𝑐𝑎𝑙𝐿2

(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)+

𝑃𝑟𝑒𝑚𝑜𝑡𝑒𝐿2 ∗ 𝐿𝑟𝑒𝑚𝑜𝑡𝑒𝐿2

(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)+

𝑃𝑚𝑖𝑠𝑠 ∗ 𝐿𝑚𝑖𝑠𝑠

(𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠/𝐿1𝑚𝑖𝑠𝑠𝑒𝑠)

Px is the probability of a memory request being satisfied by the entity x, where x is a local L2 cache, remote L2 caches, or

main memory and Lx equals the latency of each entity [11.] The combination of the localL2 and remoteL2 terms represent

the memory cycles spent on L2 cache hits and the third term depicts the memory cycles spent on L2 cache misses.

Replication increases the probability that L1 misses hit in the local L2 cache, thus the PlocalL2 term increases and the

PremoteL2 term decreases. Because the latency of a local L2 cache hit is tens of cycles faster than a remote L2 cache hit,

the net effect of increasing replication is a reduction in cycles spent on L2 cache hits. However, more replication devotes

more capacity to replica blocks, thus fewer unique blocks exist on-chip, increasing the probability of L2 cache misses,

Pmiss. If the probability of a miss increases significantly due to replication, the miss term will dominate, as the latency of

memory is hundreds of cycles greater than the L2 hit latencies. Therefore, balancing these three terms is necessary to

improve memory system performance.

Optimal performance often arises from an intermediate replication level. Figure 1 graphically depicts this tradeoff. The

Replication Benefit curve, Figure 1(a), illustrates the trend that increasing replication reduces L2 cache hit cycles. Due to

the strong locality of shared read-only requests, a small degree of L2 replication can significantly reduce L2 hit cycles by

moving many previous remote L2 hits into the local cache. In contrast, increased replication gradually reduces L2 hit

cycles because fewer unique blocks on-chip lead to fewer total L2 hits. The Replication Cost curve, Figure 1(b), illustrates

that increasing L2 replication increases the memory cycles spent on off-chip misses. The Replication Effectiveness curve,

Figure 1(c), combines the benefit and cost curves and plots the total memory cycles. Because the benefit and cost curves

are generally convex and have opposite slopes, the minimum of the Replication Effectiveness curve often lies between

allowing all replications and no replications. ASR estimates the slopes of the benefit and cost curves to approximate the

optimal replication level.

(a) (b) (c)

Figure 1[11]

By dynamically monitoring the benefit and cost of replication, ASR attempts to achieve the optimal level of replication.

ASR identifies discrete replication levels and makes a piecewise approximation of the memory cycle slope [11]. Thus

ASR simplifies the analysis to a local decision of whether the amount of replication should be increased, decreased, or

remain the same. Figure 1 illustrates the case where the current replication level, labeled C, results in HC hit cycles-per-

instruction and MC Miss cycles-per-instruction. ASR considers three alternatives: (i) increasing replication to the next

higher level, labeled H, (ii) decreasing replication to the next lower level, labeled L, or (iii) leaving the replication

unchanged [11]. To make this decision, ASR not only needs HC and MC, but also four additional hit and miss cycles-per-

instruction values: HH and MH for the next higher level and HL and ML for the next lower level. To simplify the

collection process, ASR estimates only the four differences between the hit and miss cycles-per-instruction: (1) the benefit

of increasing replication (decrease in L2 hit cycles, HC - HH); (2) the cost of increasing replication (increase in L2 miss

cycles, MH - MC); (3) the benefit of decreasing replication, (decrease in L2 miss cycles, MC - ML); and (4) the cost of

decreasing replication (increase in L2 hit cycles, HL - HC). By comparing these cost and benefit counters, ASR will

increase, decrease, or leave unchanged the replication level.

Page 6: Survey paper _ lakshmi yasaswi kamireddy(651771619)

3.3. Adaptive Probability Replication (APR)

This design is based on a distributed shared L2 cache design. To predict re-reference probability, APR adds a counter for

each cache block to record and transfer the number of accesses. In APR, each tile stores re-reference probability of blocks

from other remote L2 cache slices in its network interface component using a simple lookup table called Re-Reference

Probability Buffer (RRPB) [12]. RRPB keeps re-reference probability entries for all other L2 slices. Re-reference

probability entry holds replication thresholds for different number of accesses. The replication thresholds indicate the re-

reference probability of blocks with different number of accesses. In the local L2 slice, if there is invalid block or the

victim is not a sharing global block, the replica is filled into L2 cache slice. Otherwise, the replication is abandoned. The

insert position of replica is determined by its corresponding re-reference probability. When a replica is accessed again, it

is deleted from the local L2 cache slice and moved to the local L1 cache.

APR counts every accesses of L2 cache blocks and records the number of evicted blocks with different number of

accesses, to estimate re-reference probability at runtime. For example, the re-reference probability of block with N

accesses is the proportion of the number of blocks with exceeding N accesses accounts for that of blocks with not less

than N accesses. The estimation is only performed when a global block replacement occurs (not a replica

replacement).The re-reference probabilities are updated to all other tiles in a certain interval (such as 10000 cycles) by

attaching it to any response message. Because blocks from other remote L2 slices may be accessed in local L2 cache

slices due to replication, each replica access increases also the corresponding counter associating with the block. When a

replica is accessed, the associated counter will also be moved to L1 cache block. The values of counter of blocks in L1

caches are sent back when the blocks are evicted to the home L2 slice to accelerate the number of accesses.

Like ASR, a linear feedback shift register generates a pseudo-random number which is compared to the corresponding

replication threshold. When an evicted L1 block passes through the network interface, APR captures this message, and

looks up corresponding RRPB entry according to its address. If the corresponding replication threshold to the number of

accesses of the blocks is less than the generated random number, the block is sent to the local L2 slice. Otherwise, it is

evicted to the home L2 slice. In the local L2 cache slice, if there is an invalid block or the victim is not a sharing global

block, the block is inserted. Otherwise, it is sent to the home L2 slice. Blocks with more accesses have higher re-reference

probability. Probability insertion is implemented in APR according to the number of replicated block, in which the

number of accesses indicates the insert position. If the number of accesses of block exceeds the way size, the block is

inserted at MRU position. The aim of probability insertion is to make blocks with lower re-reference probability survival

for a shorter time.

3.4. Dynamic Reusability based Replication (DRR)

DRR dynamically replicates blocks with high reusability to other appropriate L2 cache slices, and allows the replicas be

shared by nearby cores via a fast lookup mechanism [13]. A set-associative Core Access Counter Buffer (CACB) is used

to determine which block should be replicated and corresponding destination of replication. For recent accessed blocks,

CACB record access numbers for cores exceeding certain hops (for example, 2 hops, this is also the smallest distance

among home slice and replica slices) away from the home slice respectively. So, only 10 counters are enough in one

CACB entry for a 16-core CMP. When the block receives a Read request from one core, the corresponding counter

increases. Due to coherence problem, when the block receives a Write request, all the counters of the block are reset to

zero. In CACB, larger counter means higher reusability. When the maximum counter of a block reaches a certain

threshold (for example, 5), the block is to be replicated to the slice corresponding to the maximum counter.

After the replicating block and the destination being determined, the L2 cache slice sends a replication request to the

destination. When the destination receives the replication request, it allocates cache space to hold the replica. If the

destination has available space for replica, it response acknowledge to the home L2 cache slice. Otherwise, it response fail

message to the home L2 cache slice. Once the replication operation is completed, the replication destination is stored into

a set-associative Replication Directory Buffer (RDB) in home L2 slice. If the replication fails, the destination is not stored

into RDB. When a Read request reaches the home L2 cache slice, if the distance between the requesting core and the

Page 7: Survey paper _ lakshmi yasaswi kamireddy(651771619)

nearest replica is less than the given replica distance(for example, 3 hops in 16-core CMP), the request is forwarded to the

nearest replica. Otherwise, the request is satisfied at the home L2 cache slice.

When the replica receives the forwarded request, it response data to the requesting core. When the data response message

passes through the network interface of the requesting core, the replica’s location is stored into a set-associative Network

Address Mapping Buffer (NAMB). NAMB is embedded into network interface and used to record replicas’ location

which have serviced for the core. When a L1 cache Read miss request passes through network interface, it first searches

NAMB. If the NAMB hit, the request is forwarded to the recoded replica location immediately. Otherwise, the request

continues to transfer to the home L2 cache slice. For coherency maintenance, when a L1 cache write miss request passes

through network interface, it transfers to the home L2 cache slice and does not search NAMB. This is to ensure write

operation can be serialized at the unique home L2 cache slice. Because NAMB is embedded into the network interface, its

access latency can be hidden in other network interface operations.

3.5. Locality Aware Data Replication at Last Level Cache

Run-length is defined as the number of accesses to a cache line (at the LLC) from a particular core before a conflicting

access by another core or before it is evicted. Greater the number of accesses with higher run-length, greater is the benefit

of replicating the cache line in the requester’s LLC slice. Instructions and shared-data (both read-only and read-write) can

be replicated if they demonstrate good reuse. It is also important to adapt the replication decision at runtime in case the

reuse of data changes during an application’s execution.

On an L1 cache read miss, the core first looks up its local LLC slice for a replica. If a replica is found, the cache line is

inserted at the private L1 cache. A Replica Reuse counter at the LLC directory entry is incremented. The replica reuse

counter is a saturating counter used to capture reuse information. It is initialized to ‘1’ on replica creation and incremented

on every replica hit. On the other hand, if a replica is not found, the request is forwarded to the LLC home location. If the

cache line is not found there, it is either brought in from the off-chip memory or the underlying coherence protocol takes

the necessary actions to obtain the most recent copy of the cache line. A replication mode bit is used to identify whether a

replica is allowed to be created for the particular core and a home reuse counter is used to track the number of times the

cache line is accessed at the home location by the particular core. This counter is initialized to ‘0’ and incremented on

every hit at the LLC home location. If the replication mode bit is set to true, the cache line is inserted in the requester’s

LLC slice and the private L1cache. Otherwise, the home reuse counter is incremented. If this counter has reached the

Replication Threshold (RT), the requesting core is “promoted” and the cache line is inserted in its LLC slice and private

L1 cache. If the home reuse counter is still less than RT, a replica is not created. The cache line is only inserted in the

requester’s private L1 cache [14].

On an L1 cache write miss for an exclusive copy of a cache line, the protocol checks the local LLC slice for a replica. If a

replica exists in the Modified (M) or Exclusive (E) state, the cache line is inserted at the private L1 cache. In addition, the

Replica Reuse counter is incremented. If a replica is not found or exists in the Shared(S) state, the request is forwarded to

the LLC home location. The directory invalidates all the LLC replicas and L1 cache copies of the cache line, thereby

maintaining the single-writer multiple-reader invariant. On an invalidation request, both the LLC slice and L1 cache on a

core are probed and invalidated. If a valid cache line is found in either caches, an acknowledgement is sent to the LLC

home location. In addition, if a valid LLC replica exists, the replica reuse counter is communicated back with the

acknowledgement. The locality classifier uses this information along with the home reuse counter to determine whether

the core stays as a replica sharer. If the (replica +home) reuse is greater than or equal to the RT, the core maintains replica

status, else it is demoted to non-replica status. When an L1 cache line is evicted, the LLC replica location is probed for the

same address. If a replica is found, the dirty data in the L1 cache line is merged with it, else an acknowledgement is sent

to the LLC home location. However, when an LLC replica is evicted, the L1 cache is probed for the same address and

invalidated. An acknowledgement message containing the replica reuse counter is sent back to the LLC home location. If

the replica reuse is greater than or equal RT, the core maintains replica status, else it is demoted to non-replica status.

After all acknowledgements are processed, the Home Reuse counters of all non-replica sharers other than the writer are

reset to ‘0’. This has to be done since these sharers have not shown enough reuse to be “promoted”. If the writer is a non-

replica sharer, its home reuse counter is modified as follows. If the writer is the only sharer (replica or non-replica), its

Page 8: Survey paper _ lakshmi yasaswi kamireddy(651771619)

home reuse counter is incremented, else it is reset to ‘1’. This enables the replication of migratory shared data at the

writer, while avoiding it if the replica is likely to be downgraded due to conflicting requests by other cores.

4. Results

APR improves performance by 12% on average for splash-2 benchmark over Baseline (shared cache design), 24% for

parsec benchmark over Baseline. VR displays similar performance in splash-2 and parsec benchmarks that is 5% over

Baseline for splash-2, and 4% over Baseline for parsec. ASR shows similar performance in splash-2 benchmark with VR,

but is better than VR in parsec benchmark (15% over Baseline). R-NUCA obtains 2% and 8% performance gains for

splash-2 and parsec benchmarks respectively. It is because that instructions are with strong locality and occupy fewer

capacity in splash-2 and parsec benchmarks. APR demonstrates its stable performance improvement in splash-2 and

parsec benchmarks. Totally, APR improves performance by 21% on average over baseline, by 17% over VR, by 10%

over ASR, and by 15% over R-NUCA. Replication schemes increase the miss rate of L2 cache. Figure 4 and 5 show the

normalized L2 cache miss ratio of evaluated replication schemes for splash-2 and parsec benchmarks respectively. APR

improves L2 miss ratio by as much 49% for splash-2 benchmark, and by 38% for parsec benchmark. Compared to VR and

ASR, APR shows lower L2 miss ratio. This comes from its replication filtering policy and replica insertion policy.

Probability replication filtering policy reduces the contention of L2 cache capacity, and the probability insertion policy

reduces the residency time of replicas. Both policies tend to reduce the impact on L2 limited capacity caused by extra

replicas.

Figure 2: Normalized Execution time for Splash-2 benchmarks

Figure 3: Normalized Execution time for Parsec Benchmark

Page 9: Survey paper _ lakshmi yasaswi kamireddy(651771619)

Figure 4: Normalized Miss ratio for Splash-2 benchmarks

Figure 5: Normalized Miss ratio for Parsec benchmarks

DRR achieves lower read latency over other techniques. Figure 5 shows normalized L2 cache average read latency. VR,

ASR, and R-NUCA do not reduce read latency against Baseline, while DRR reduces read latency by 12%. Such results

show that DRR takes full advantage of benefits of replicas by network address mapping mechanism. Unnecessary extra

search latency offsets benefits of replicas in VR and ASR. R-NUCA’s instruction replication has limited benefit for

splash-2 and parsec benchmarks. Figure 6 shows the normalized execution time. As can be seen, the DRR improves the

total execution time of almost all benchmarks compared to baseline system, VR, ASR, and R-NUCA. The maximum

performance gain happens at dedup benchmark with about 69% performance improvement. The average performance

improvement is about 30% over the baseline system, about 16% over the VR, about 8% over ASR, and about 25% over

R-NUCA. While the performance improvements vary across different benchmarks, DRR does show better performance in

almost all cases indicating the good adaptive feature of reusability-based replication scheme. We measured the L2 cache

miss rate as shown in Figure 7. Compared to the baseline system, VR increases L2 cache miss rate about 162%, ASR

increases about 91%, R-NUCA increases about 67%, and DRR increases about 48%.

Page 10: Survey paper _ lakshmi yasaswi kamireddy(651771619)

Figure 6: Normalized average read latency

Figure 7: Normalized Execution time

Figure 8: Normalized L2 miss ratio

Page 11: Survey paper _ lakshmi yasaswi kamireddy(651771619)

The locality-aware protocol provides better energy consumption and performance than the other LLC data management

schemes. It is important to balance the on-chip data locality and off-chip miss rate and overall, an RT of 3 achieves the

best trade-off. It is also important to replicate all types of data and selective replication of certain types of data by R-

NUCA (instructions) and ASR (instructions, shared read-only data) leads to sub-optimal energy and performance. Overall,

the locality-aware protocol has a 16%, 14%, 13% and 21% lower energy and a 4%, 9%, 6% and 13% lower completion

time compared to VR, ASR, R-NUCA and S-NUCA respectively.

Figure 9: Normalized Energy

Figure 10: Normalized Completion Time

Page 12: Survey paper _ lakshmi yasaswi kamireddy(651771619)

Figure 11: Normalized L1 Cache Miss

5. Conclusions

For applications with working sets that fits within the LLC even if replication is done on every L1 cache Miss ;Locality-

aware scheme perform well both in energy and performance. VR is good for applications with high access to shared read-

write data but has higher L2 cache energy than other schemes. Applications with higher accesses to instructions and

shared read only data are benefited by ASR. Locality aware protocol, APR, DRR also perform almost the same in such

cases. Replication of migratory shared data requires creation of a replica in an Exclusive coherence state. The locality-

aware protocol makes LLC replicas for such data when sufficient reuse is detected and hence performs well. Similarly

APR and DRR also perform comparatively well for such kind of data. Application of probability replication and

probability insertion policies to VR and ASR have proven to show better performance than the individual schemes. From

the above analysis it can be understood that it depends on the kind of data the application uses the most and hence the

replication policy has to be chosen basing on the kind of data. However out the techniques discussed it can be seen that

the newly proposed APR, DRR and the Locality Aware Replication Schemes perform better in most cases than the

existing ASR and VR. Again only dynamic replication schemes are discussed above and static replication schemes have

their own benefits in certain types of applications. This can be seen by improved performance of R-NUCA for certain

cases. So a replication scheme should be selected in such a way that it satisfies most of the needs of the CMP i.e. as many

applications as possible.

6. References

[1] G. Kurian, O. Khan, and S. Devadas. The locality-aware adaptive cache coherence protocol. In Proceedings of the

40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 523–534, New York, NY, USA, 2013.

ACM.

[2] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar. Near-threshold voltage (NTV) design

opportunities and challenges. In Design Automation Conference, 2012.

[3] P. Conway, N. Kalyanasundharam, G. Donley, K. Lepak, and B. Hughes. Cache Hierarchy and Memory Subsystem of

the AMD Opteron Processor. IEEE Micro, 30(2), Mar. 2010.

[4] First the tick, now the tock: Next generation Intel microarchitecture (Nehalem). White Paper, 2008.

[5] C. Kim, D. Burger, and S. W. Keckler. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-

Chip Caches. In International Conference on Architectural Support for Programming Languages and Operating Systems,

2002.

[6] Zhang, M. and K. Asanovic, Victim replication: Maximizing capacity while hiding wire delay in tiled chip

multiprocessors. Proceedings -International Symposium on Computer Architecture, 2005: p. 336-345.

Page 13: Survey paper _ lakshmi yasaswi kamireddy(651771619)

[7] Beckmann, B.M., M.R. Marty, and D.A. Wood, ASR: Adaptive selective replication for CMP caches. Proceedings of

the Annual International Symposium on Microarchitecture, MICRO, 2006: p. 443-454.

[8] Hardavellas, N., M. Ferdman, B. Falsafi and A. Ailamaki, Reactive NUCA: Near-Optimal Block Placement and

Replication in Distributed Caches. the 36th Annual International Symposium on Computer Architecture, 2009: p. 184-

195.

[9] Chang, J.C. and G.S. Sohi, Cooperative caching for chip multiprocessors the 33rd International Symposium on

Computer Archtiecture, Proceedings, 2006: p. 264-275.

[10] Beckmann, B.M. and D.A. Wood, Managing wire delay in large chip-multiprocessor caches. Micro-37 2004: 37th

Annual International Symposium on Microarchitecture, Proceedings, 2004: p. 319-330.

[11] Kandemir, M., F. Li, M. J. Irwin, and S. W. Son, A novel migration-based NUCA design for Chip Multiprocessors.

in High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for. 2008.

[12]Jinglei Wang, Dongsheng Wang, Haixia Wang, and Yibo Xue, High Performance Cache Block Replication Using

Re-Reference Probability in CMPs , High Performance Computing (HiPC), 2011 ,18th International Conference .

[13]Jinglei Wang, Dongsheng Wang, Haixia Wang, Yibo Xue, Dynamic Reusability-based Replication with Network

Address Mapping in CMPs, High Performance Computing (HiPC), 2011 ,18th International Conference.

[14]Kurian, G., Devadas, S., Khan, O.: Locality-Aware Data Replication in the Last-Level Cache. In: 20th International

Symposium on High Performance Computer Architecture, pp. 1-12, IEEE Press, New York (2014) .