9
A Hybrid NoC Design for Cache Coherence Optimization for Chip Multiprocessors Hui Zhao, Ohyoung Jang, Wei Ding Department of Computer Science and Engineering, The Pennsylvania State University { hzz105, oyj5007, wzd109}@cse.psu.edu Yuanrui Zhang Intel Inc. {yuanrui.zhang}@intel.com Mahmut Kandemir, Mary Jane Irwin Department of Computer Science and Engineering, The Pennsylvania State University {kandemir, mji}@cse.psu.edu ABSTRACT On chip many-core systems, evolving from prior multi-pro cessor systems, are considered as a promising solution to the performance scalability and power consumption problems. The long communication distance between the traditional multi-processors makes directory-based cache coherence pro- tocols better solutions compared to bus-based snooping pro- tocols even with the overheads from indirections. However, much smaller distances between the CMP cores enhance the reachability of buses, revitalizing the applicability of snoop- ing protocols for cache-to-cache transfers. In this work, we propose a hybrid NoC design to provide optimized support for cache coherency. In our design, on-chip links can be dy- namically configured as either point-to-point links between NoC nodes or short buses to facilitate localized snooping. By taking advantage of the best of both worlds, bus-based snooping coherency and NoC-based directory coherency, our approach brings both power and performance benefits. Categories and Subject Descriptors B.4.3 [Interconnections (Subsystems)]: Topology; C.1.2 [Multiple Data Stream Architectures (Multiproces- sors)]: Interconnection architectures General Terms Design, Management, Performance Keywords Multi-core, NoC, Cache Coherence, Bus 1. INTRODUCTION AND MOTIVATION As modern fabrication technologies advance into deep sub- micron era, chip multiprocessors (CMPs) are moving from multi-core to many-core architectures in order to fully take advantage of the increasing number of transistors available This work is supported in part by NSF grants 1147388, 1152479, 1017882, 0963839, 0811687, and a grant from Mi- crosoft. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2012, June 3-7, 2012, San Francisco, California, USA. Copyright 2012 ACM 978-1-4503-1199-1/12/06 ...$10.00 [1, 2]. Although in many aspects, many-core CMP systems exhibit similarities to their predecessors, multi-processor sys- tems, there are two major differences between these two types of systems. First, many-core systems are more con- strained by the limited on-chip memories. Second, inter-core communication latencies are greatly reduced because of the short distance between on-chip cores. The first difference brings new challenges on how to design an efficient memory system with limited capacity. The second difference opens up opportunities for reducing the memory access latencies. A very important component of memory system design is the cache coherence protocols. Cache coherence protocols should not only ensure data access consistency, but should also have low performance and energy overheads. In this direction, several techniques have been proposed, such as tree-based coherence and token coherence [3, 4, 5]. In this paper, we investigate new schemes to optimize cache coherence by taking advantage of the short communi- cation distances in many-core CMPs. Specifically, we pro- pose a novel network-on-chip (NoC) architecture that can configure on-chip links into high-speed snooping buses in order to support cache-to-cache data transfer. Conventional multi-processor systems employ directory-based cache co- herence protocols because bus-based snooping coherence is not scalable to high core counts. Bus-based snooping is not considered a good option for many-core CMPs due to the similar scalability concerns. However, when an L1 access misses in the directory-based coherence, directories have to be accessed to obtain the sharer information, resulting in extra delays and power overheads . We observe that, when multiple cores are running threads of a same application, there exist opportunities for a core to find data sharers in its neighborhood (nearby cores). To exploit such opportu- nities, we propose a scheme that connects several point-to- point links of the NoC together to form short-ranged snoop- ing buses. When a core does not find the requested data in its L1 cache, it first snoops opportunistically for a copy of the data in the L1 caches of nearby cores. Indirections to the directories are avoided if such snoops result in hits. Conse- quently, directories are accessed only when the snooping can- not find a copy of the data. Our proposed snooping scheme has the advantage of short latencies and low power overhead because snooping messages are transferred on buses instead of NoC routers. Our other important observation is that the effectiveness of our local snooping scheme is closely related to the applica- 834

[ACM Press the 49th Annual Design Automation Conference - San Francisco, California (2012.06.03-2012.06.07)] Proceedings of the 49th Annual Design Automation Conference on - DAC '12

Embed Size (px)

Citation preview

A Hybrid NoC Design for Cache Coherence Optimizationfor Chip Multiprocessors ∗

Hui Zhao, Ohyoung Jang,Wei Ding

Department of ComputerScience and Engineering, ThePennsylvania State University

{ hzz105, oyj5007,wzd109}@cse.psu.edu

Yuanrui ZhangIntel Inc.

{yuanrui.zhang}@intel.com

Mahmut Kandemir, MaryJane Irwin

Department of ComputerScience and Engineering, ThePennsylvania State University

{kandemir,mji}@cse.psu.edu

ABSTRACTOn chip many-core systems, evolving from prior multi-processor systems, are considered as a promising solution to theperformance scalability and power consumption problems.The long communication distance between the traditionalmulti-processors makes directory-based cache coherence pro-tocols better solutions compared to bus-based snooping pro-tocols even with the overheads from indirections. However,much smaller distances between the CMP cores enhance thereachability of buses, revitalizing the applicability of snoop-ing protocols for cache-to-cache transfers. In this work, wepropose a hybrid NoC design to provide optimized supportfor cache coherency. In our design, on-chip links can be dy-namically configured as either point-to-point links betweenNoC nodes or short buses to facilitate localized snooping.By taking advantage of the best of both worlds, bus-basedsnooping coherency and NoC-based directory coherency, ourapproach brings both power and performance benefits.

Categories and Subject DescriptorsB.4.3 [Interconnections (Subsystems)]: Topology; C.1.2[Multiple Data Stream Architectures (Multiproces-sors)]: Interconnection architectures

General TermsDesign, Management, Performance

KeywordsMulti-core, NoC, Cache Coherence, Bus

1. INTRODUCTION AND MOTIVATIONAs modern fabrication technologies advance into deep sub-

micron era, chip multiprocessors (CMPs) are moving frommulti-core to many-core architectures in order to fully takeadvantage of the increasing number of transistors available

∗This work is supported in part by NSF grants 1147388,1152479, 1017882, 0963839, 0811687, and a grant from Mi-crosoft.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAC 2012, June 3-7, 2012, San Francisco, California, USA.Copyright 2012 ACM 978-1-4503-1199-1/12/06 ...$10.00

[1, 2]. Although in many aspects, many-core CMP systemsexhibit similarities to their predecessors, multi-processor sys-tems, there are two major differences between these twotypes of systems. First, many-core systems are more con-strained by the limited on-chip memories. Second, inter-corecommunication latencies are greatly reduced because of theshort distance between on-chip cores. The first differencebrings new challenges on how to design an efficient memorysystem with limited capacity. The second difference opensup opportunities for reducing the memory access latencies.A very important component of memory system design isthe cache coherence protocols. Cache coherence protocolsshould not only ensure data access consistency, but shouldalso have low performance and energy overheads. In thisdirection, several techniques have been proposed, such astree-based coherence and token coherence [3, 4, 5].In this paper, we investigate new schemes to optimize

cache coherence by taking advantage of the short communi-cation distances in many-core CMPs. Specifically, we pro-pose a novel network-on-chip (NoC) architecture that canconfigure on-chip links into high-speed snooping buses inorder to support cache-to-cache data transfer. Conventionalmulti-processor systems employ directory-based cache co-herence protocols because bus-based snooping coherence isnot scalable to high core counts. Bus-based snooping is notconsidered a good option for many-core CMPs due to thesimilar scalability concerns. However, when an L1 accessmisses in the directory-based coherence, directories have tobe accessed to obtain the sharer information, resulting inextra delays and power overheads . We observe that, whenmultiple cores are running threads of a same application,there exist opportunities for a core to find data sharers inits neighborhood (nearby cores). To exploit such opportu-nities, we propose a scheme that connects several point-to-point links of the NoC together to form short-ranged snoop-ing buses. When a core does not find the requested data inits L1 cache, it first snoops opportunistically for a copy ofthe data in the L1 caches of nearby cores. Indirections to thedirectories are avoided if such snoops result in hits. Conse-quently, directories are accessed only when the snooping can-not find a copy of the data. Our proposed snooping schemehas the advantage of short latencies and low power overheadbecause snooping messages are transferred on buses insteadof NoC routers.Our other important observation is that the effectiveness

of our local snooping scheme is closely related to the applica-

834

tag 0xabcS

tag 0xabcstate I

tag 0xabcstate I

tag 0xabcstate I

0xabc

directorytag 0xabcsharer

n20

state

n0 n1 n2 n3L1 $ L1 $ L1 $ L1 $

anothercluster

(a) snoop hit

L1 $tag 0xabcstate I

L1 $tag 0xabcstate I

L1 $tag 0xabcstate I

L1 $tag 0xabcstate I

n0 n1 n2 n3

directorytag 0xabcsharer

n20 anothercluster

(b) snoop missFigure 1: Snoop coherence in case of an L1 miss.

tion mappings. Fixed buses cannot exploit the communica-tion locality if the data sharers are not mapped to the coresconnected by a bus. In order to decrease the dependenceof snooping effectiveness on the application mappings, wepropose novel schemes to dynamically build snoop buses, inan on-demand fashion, based on the locality of data sharers.We make the following contributions in this paper:∙ We propose a hybrid approach to cache coherence thatemploys re-configurable snooping buses to reduce theeffect of application mappings on snooping effective-ness.

∙ We propose dynamically constructing snooping busesby connecting on-chip point-to-point links together.This technique can reduce the hardware overhead with-out compromising on-chip bandwidth. To the best ofour knowledge, we are the first one to propose suchtechniques.

∙ We provide a detailed design of the snooping bus, in-cluding the bus arbitrator and bus switch interfaces.

∙ We design clustering and bus-building algorithms togroup data sharing cores into local groups and buildshort buses to facilitate the broadcasting of snoopingmessages (presented in the supplemental section).

2. DESIGN OF CACHE COHERENCEOur cache coherence proposal is built upon the principle

that, before sending requests to the directory, a core run-ning parallel programs first snoops cores in its vicinity forshared data. If a snooped core can provide the data re-quested, which we call a snoop hit, a cache-to-cache transferis performed between the data requester and provider. Suchcache-to-cache data transfers are made possible by connect-ing the point-to-point links of the NoCs to form snoopingbuses. Our design of cache coherence involves two mech-anisms: global directory-based protocol and local snoopingprotocol. We optimistically group data-sharing cores thatare located in close distance to build a localized snoopingcluster. In such a cluster, snooping coherence is employedto facilitate cache-to-cache data transfer, without involvingglobal (chip wide) directory. Globally, among different clus-ters, cache coherence is maintained through structures sim-ilar to the traditional directories.

2.1 A Walkthrough ExampleFigure 1 shows the detailed behavior of our coherence pro-

tocol in the case of an L1 miss. Node 2 issues a load to accessa cache line with a tag of 0xabc, but results in a miss. In-stead of sending request to the directory node, it first sendsa snooping query on the locally connected snoop bus. In thiscase, three other cores are connected by the snooping bus inthe same local cluster with core 2: cores 0, 1 and 3. All othercores search their L1 cache for the tag of the requested cacheline. If there is another core in the this cluster has a validcopy of the cache line, for example core 0, then core 0 putsthe data on the bus. After a few cycles, core 2 can grab the

data from the bus and save it to its own L1 cache, conclud-ing a cache-to-cache data transfer. If, on the other hand,none of the cores snooped has a valid copy of the soughtcache line, then core 2 needs to send a request to directory(node 20), as in the directory-based protocols. Since our co-herence protocol involves both snooping and directory-basedcache coherences, we need to pay extra attention to coordi-nate state changes both locally and globally. we create anew state: shared-exclusive. This state is used to identifythis locally shared and globally exclusive status. Had a coreshared globally exclusive data locally, both this core and itssharers need to change to shared-exclusive state. Otherwise,its state remains exclusive. In a similar manner, modifiedstates need to be distinguished between whether or not theonly copy in the directory’s view has actually been sharedinside the snooping cluster. So, we also add a new state,called shared-modified to handle this situation.

2.2 Writes and InvalidationsIf an L1 core gets a write hit, the new state is decided

based on the cache line’s current state. For exclusive ormodified state, the operations needed are the same as in aconventional directory based protocol. In the situation thatthe returned state is shared (meaning there are shared datacopies outside its snooping cluster), an invalidation messagehas to be sent to the directory to nullify all other possiblesharers. If the cache line state is shared-exclusive or shared-modified, that means there are several copies of the cacheline in the same snooping cluster. However, globally, thereis no other copy of the data in any cores outside this core’ssnooping cluster. As a result, the core only needs to broad-cast locally to invalidate other sharers in the same clusterbefore modifying the data. In this case, no involvement ofthe global directory is needed and an indirection to directoryis avoided.In the only remaining case where an L1 write returns a

miss, our coherence protocol also tries to avoid indirectionsto directories. Before going to the directory to fetch thedata, the requesting core first broadcasts a message to localcores in the same snooping cluster. If there is any core thathas the data in shared state, we invalidate local sharers andaccess the directory to invalidate global sharers. However, ifthere are other cores with the data in Modified or Exclusivestate, indicating that the only one copy of data of the wholesystem is in the same cluster, then there is no need to goto the directory to get this information. Instead, the corecan invalidate the other copy, and change its own state tobe modified. If there exist some cores in the same snoopingcluster with the state of shared-exclusive or shared-modified,this means all the copies of the data line is in this cluster.Therefore, the requester can invalidate all other local sharersand change its own state to be modified and no messageneeds to be sent to the directory.

835

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Bus Switch/RouterNoC P2P LinksSnooping Bus

(a) Link sharingbetween bus andNoC.

RS

ControllableSwitch

N

S

EW

Shared link

NIC

(b) Bus switch/routerdesign.

Figure 2: Proposed hardware design

2.3 Organizing Local Clusters in the Direc-tory

In the conventional MESI protocols, silent evictions areperformed of exclusive and shared lines in order to avoidthe bandwidth overhead of notifying the directory. In ourproposed protocol, when replacing an L1 line, silent evic-tions are still valid for shared, exclusive and shared-exclusivestates. However, for the shared-modified state, we need toensure that modified data are written to the next level cacheproperly. Our policy for replacing cache lines with shared-modified state is as follows: the cache first needs to snoopsharers in local cluster; if a sharer is found, then the linecan be evicted silently. If there is no other cores holdingthe line in the same cluster, i.e., this is the only core hold-ing the modified data in the cluster, a writeback needs tobe performed just like replacing a cache line with modifiedstate.In our design, we assume a cluster is a group of cores

that are running threads that belong to the same applica-tion. When a new application’s threads are mapped to CMPcores, we create local clusters of cores running that appli-cation’s threads. Such cluster structures will exist until theapplication finishes its execution. In traditional directory-based cache protocols, there is a bit map for each cacheline associated with one directory. If a core’s L1 cache hasthe valid data of the cache line, the corresponding bit is setto 1. In our protocols, on the other hand, since we groupcores that possibly share data into clusters, each cluster onlyneeds one representative in the directory. From the direc-tory’s point of view, the cores inside the cluster behave likejust one core because all the copies of data are consistentinside a cluster. That is, either there is only core having thedata inside the cluster, or all copies are exactly same.

3. CONSTRUCTING SNOOPING BUSESFROM NOC LINKS

3.1 Reusing NoC links to build snooping busesBuses are widely used in off-chip interconnections. How-

ever, with the growing discrepancy between wire delay andgate delay, buses are not scalable to be employed in many-core systems. Consequently, for on-chip communications,packet switched network-on-chips are considered a bettersolution. However, the power consumptions of such NoCnetworks can be very high as the routers are power hungry,due to their inside components such as buffers and crossbars.It has been observed that the on-chip routers can consumeup to 40% of the network power [10].Our belief is that, even if buses are not suitable to be used

as global connections, there are still advantages to employthem as local connections. When applied to a subarea of

the NoC network, buses have the advantages of short delaysand lower energy consumptions compared to router basedNoCs. If fact, there have been several previous works thathave reconsidered the application of buses to connect on-chipcores [8, 10, 11].In this work, we propose to use dynamically configured

buses to support localized cache snooping. The advantageof doing so is three fold: firstly, buses have inherent or-dering property which can support cache access coherency;secondly, buses can provide simultaneous broadcasting with-out accessing the cores one after another in a sequentialmanner; and thirdly, buses can reduce transaction delaysand consume less power compared to the routers. In thiswork, our baseline network is point-to-point link connectedrouter-based NoC. We expand the functionality of routersby adding bus switches to them so that several links canbe connected together to form a short-ranged snooping bus.At one time, the on-chip links can be used either to transferdata packets between routers or to form a short snoopingbus. As shown Figure 2(a), cores 12, 13, 9, 10 and 6 arecommitting a snooping bus transaction on the dotted lines.At the same time, all other links in the NoC are availableto transmit packets between the routers. In a conventionalNoC, the links connecting routers are unidirectional, and asa result, one router can both send and receive packets in onedirection at the same time. In our proposed design, whenlinks are connected to build snooping buses, all the wiresare used as bi-directional links. Consequently, the snoopingbuses have bandwidth doubled compared to the point-to-point links which can offset the delays incurred by increasedlength of the buses. By reusing the on-chip links betweenbuses and NoC links in a time-division multiplexing man-ner, our design can reduce the hardware overhead withoutcompromising the NoC bandwidth.

3.2 Bus Switch DesignIn order to connect NoC links to build snooping buses, we

need to extend the routers to provide bus switching function-ality. Figure 2(b) shows our proposed link interconnectioninterface design. There are two major components: packetrouting component (R) and bus switching component (S).The routing component is similar to the conventional NoCrouters consisting of buffers, routing logic and a crossbar.The bus switching component is a programmable switch thatcan connect each segment (an NoC link) to form a bus. Linksfrom each direction connected to our Routing/Bus Switchingunit are controlled by programmable switches. As shown inFigure 2(b), links in the north and west directions are pro-grammed to be used by a bus, whereas links in the south andeast directions can serve as NoC links. When links are usedas part of a bus, packets need to stay inside the router untilthe bus transaction is over. This is similar to the scenarioin the conventional NoC routers where multiple packets arecompeting for the same output link. Packets failed to begranted the output link have to stay inside the buffer forthe next chance. Our scheme uses the same mechanism tohold data packets when their output links are used by thebus.

3.3 Bus Arbiter DesignAnother essential component of a bus design is the ar-

biter. The reason buses can work as a mechanism to pro-vide transaction ordering is because the users need to takeorder to control a bus. Such arbitrations of bus control are

836

Arbitrer

0 2bus grant n

bus grant 0

req n

req0 1 n

(a) centralized arbitration

0 1 2 nbus grant

bus requestbusy

(b) decentralized arbitrationFigure 3: Traditional centralized bus arbiter design and our proposed decentralized bus arbiter design

implemented through bus arbiters. A typical arbitrator de-sign is shown in Figure 3(a). Each core connected to a busfirst send requests to the arbiter. The arbiter then makesa decision based on some priority policies to grant the buscontrol to only one requester. It is easy to build such anarbiter if the bus configuration is fixed. However, this typeof centralized arbiters do not fit in our reconfigurable en-vironment since the bus topology is changing from time totime, based on the applications’ mappings.We propose to employ de-centralized bus arbitration mech-

anisms [13]. As shown in Figure 3(b), there is no centralizedarbiters. Instead, the arbitration logic is distributed acrossall bus switches and is also reconfigurable. The bus grantinput indicates that the bus can be granted to a requestor.Bus busy indicates whether the bus is being used by somedevice and the bus request line shows if another device hasmade a request. Each device needs to make a request first.If the bus busy is negative, then the device negate its ownbus grant out signal and wait to see if its grant IN signal isasserted. If so, the device can grab the bus and assert thebus busy signal. The bus grant out signal also needs to beasserted so other devices can have their grant in to be set inorder to compete in the next bus arbitration. More detaileddesign description can be found in [13].Similar to the data lines of the snooping buses, the bus ar-

bitration control lines are segmented and can be programmedthrough the bus switches. The major arbitration lines arethe request, busy and grant lines, as illustrated in Figure 3.Each data link is associated with a set of such segmentedcontrol lines. At the time the data links are configured tobuild connected buses, the corresponding control lines areconnected in the same way to build a separate arbitrationnetwork. Then, each core can send its requests across thearbitration logic in a decentralized manner. All the arbi-tration logics on a snooping bus work together to decide thewinner of the bus control. Some components in our proposedscheme incur extra area overhead, such as the bus switchesand the bus arbitration logics. However, our area overheadcomes from simple logics and wires which is small com-pared to the rather complicated router structures (buffersand crossbars).

4. EXPERIMENTAL EVALUATION4.1 Experimentation SetupWe evaluate our proposed techniques using a trace-driven,

cycle-accurate CMP simulator that has a built in NoC net-work. We use GEMS [14] to generate traces from SPLASH[15] and SPEC OMP [16] benchmarks and feed the tracesinto our CMP simulator. Application threads are randomlymapped to CMP cores. Our baseline architecture has 64cores organized as a 2D 8 by 8 mesh. Each NoC node con-sists of a core, a private write-back L1 cache and a tiled L2bank. The default memory hierarchy is a two-level directory-based MESI cache coherence protocol. Each router has twopipeline stages with the input buffer depth of four. OurNoC employs wormhole switching and virtual-channel flow

Fixed bus with length of 4

Fixed bus with length of 8

(a) fixed bus configu-rations

1 0 2 0 3 3 1 21 0 3 2

02 1 3

3 1 2 0 0 21

20 0 3 3 2 1 1 3

Dynamic bus with length of 4

Dynamic bus with length of 7

(b) dynamic bus configura-tions

Figure 4: Experimented bus configurations

control and use the deterministic X-Y routing algorithm toroute packets. Table 1 provides the main parameters of oursimulation platform.We evaluate two types of snooping bus configurations.

The first one is called fixed bus configuration, where the coresconnected by a bus is fixed, no matter how the applicationsare mapped to the NoC nodes. In the second configuration,called dynamic bus configuration, we use our clustering andbus-building algorithms (described in detail in the supple-mental section) to dynamically configure buses that connectcores. For each of the bus configurations we have, we ex-periment with different bus lengths of 4 links and 8 links re-spectively. In the following discussion, we refer to these busconfigurations as Fix-4, Fix-8, Dynamic-4 and Dynamic-8respectively. Figure 4 (a) shows two fixed bus configuationswith length of 4 and 8 respectively. Figure 4 (b) illustratestwo dynamically constructed buses proposed by our scheme:the first bus connects 5 cores running threads of application0 and the second bus connects 3 cores running threads ofapplication 2.We use CACTI 6.5 [17] to estimate the delay and power

values for the links. Additional loading due to multiplesenders and receivers is considered when we get these pa-rameters. We use Orion [19] to obtain the router power.Both our network and NoC structures run at a frequency of3GHz. Table 2 gives our bus related parameter setups.Processors SPARC 3 GHz processor, two-way out of order, 64-

entry instruction windowL1Cache

64 KB private cache, 4-way set associative, 128Bblock size, 2-cycle latency, split I/D caches

L2Cache

shared L2 cache, with 1MB banks, 16-way set asso-ciative, 128B block size, 6-cycles latency, 32 MSHRs

Memory 4GB, 260 cycle off–chip access latencyNoC 2-stage pipeline, 128 bit flits, 4 flits per packet, X-Y

routing

Table 1: Baseline CMP configuration.Parameters link Bus of 4 links Bus of 8 linksLength(mm) 3.1 9.6 22.4Delay(ns) 0.13 0.41 0.93

Dynamic Energy(pJ) 0.93 2.88 6.51Leakage Pwr(mw) 0.03 0.09 0.21

Table 2: Energy and delay of buses and links.4.2 Results

Impact on Memory Latency. Figure 5 plots the im-pact of the localized snooping on L1 load miss latencies. Weobserve that, on average, the Dynamic-4 configuration canreduce the load miss latency by about 10%. Dynamic-8 pro-vides further improvements, lowering the load miss latencies

837

0.4 0.6 0.8

1 1.2 1.4 1.6

Norm

alize

d Lo

ad

Miss

Late

ncy baseline directory

fixed-4 fixed-8 dynamic-4 dynamic-8

Figure 5: Normalized L1 cache miss latency compared to baseline MESI protocol.

0% 20% 40% 60% 80%

Snoo

p Hi

t Rat

e

fixed-4 fixed-8 dynamic-4 dynamic-8

Figure 6: Snoop hit rates with different bus configurations(%).

0.7 0.8 0.9

1 1.1 1.2

Norm

alize

d IP

C

baseline directory fixed-4 fixed-8 dynamic-4 dynamic-8

Figure 7: IPC compared to a system using MESI protocol.

0.4 0.6 0.8

1 1.2

Nor

mal

ized

N

etw

ork

Traf

fic

baseline directory fixed-4 fixed-8 dynamic-4 dynamic-8

Figure 8: Normalized network traffic compared to baseline MESI protocol.

0.60.81

1.2

Ene

rgy

C

onsu

mpt

ion

baseline directoryfixed-bus-config1fixed-bus-config2dynamic-bus-config1dynamic-bus-config2

Figure 9: Normalized network energy consumption compared to a system using MESI protocol.

by 20% on the average, with a maximum reduction of 30% inthe case of wupwise. These improvements are achieved byavoiding unnecessary indirections to the directory, as dis-cussed in Section 2. In most of the cases, Dynamic-8 incurslower load miss latency. This is due to the fact that a largernumber of cores are snooped by this configuration.Compared to the dynamically configured buses, the fixed

buses experience larger miss latencies in most cases. In sev-eral cases, the snooping results for Fix-8 are even worse com-pared to Dynamic-4. This proves that our dynamic busesperform better than fixed buses when the application map-ping is not optimal. In such situations, longer snooping dis-tance does not necessarily guarantee more snoop hits, butinstead results in increased load latency due to more cyclesbeing spent on snooping transactions.

Snoop Hit Rates. Figure 6 shows the measured snoophit rates of the experimented snoop bus configurations. Ourdynamically configured buses can improve the snoop hitrate compared to the fixed buses by up to 50%. Thereare some interesting cases where our Dynamic-4 configura-

tion has lower hit rate than the Fix-4 configurations, e.g., inapsi, barnes and lu. This is because our clustering and bus-building algorithms use local information to generate groupof sharers to be snooped together. Also, since links are notreused between clusters, some nodes could not get connectedby the snooping buses in our dynamic scheme, even if this ispossible under a fixed bus scheme. This is more pronouncedwhen the bus length is short, as is in the Dynamic-4 in ourcase. However, in most cases, Dynamic-4 performs bettercompared to Fix-4 in terms of the snoop hit rates. Notethat higher snoop hit rate does not necessarily indicate im-provements in performance because longer buses also incurmore delay cycles.

Performance Improvements. Figure 7 shows the over-all improvement in execution time of our proposed schemecompared to the baseline MESI directory protocols. Our dy-namic bus configurations deliver performance improvementsof up to 12%, with only barnes and apsi suffering a slightslowdown when the bus length is short. The reason is thatsometimes short dynamic bus constructions suffer from gen-

838

erating isolated nodes, as explained in the snoop hit rateanalysis. On an average, the dynamic bus configuration im-proves performance by about 8%. This is significant consid-ering the high L1 cache hit rates which limit the impact ofour optimizations on the execution time.When we compare fixed and dynamic bus configurations,

we can easily notice the advantages of the dynamic config-urations. In the cases of art, ocean, choleaky and wupwise,both of the fixed bus configurations results in lower IPC,compared to the baseline directory protocols. On average,the Fix-4 configuration performs worse than the baseline di-rectories. This demonstrates that it is difficult to designeffective snooping schemes without considering the affect ofapplication mappings. These results also underline the im-portance of our dynamic bus structure that works even ifthe application mapping is not ideal.

Impact on Network Traffic. As shown in Figure 8,except for cholesky and ocean, our proposed scheme can re-duce the network traffic by about 20%. This explains whythe performance still improves even if the snooping busespreempt the NoC links, increasing the latency required for adata packet to reach its destination. Since the snooping hitsdecrease the number of directory visits, the network trafficalso gets reduced. Even though some packets have to beheld in buffers due to conflicted output link with snoopingbuses, due to the distributed nature of our snooping buses,the unoccupied links in the NoC can still transmit packetsin other areas of the network at the same time. In addition,an NoC network already provides high bandwidth for datatransmission by packetizing and routing the data on differ-ent routes, losing some links for a few cycles does not affectthe network performance in most cases.

Energy Savings. Besides the performance improvements,our proposed snooping bus scheme can significantly reducethe network energy consumption. Figure 9 plots the normal-ized network energy consumptions under differentschemes. For example, in barnes, the energy savings canbe as much as 37% for dynamic bus configurations. In mostof our benchmarks, the proposed scheme can save energybetween 15% to 20%. The only exceptions are choleskyand ocean. As we have analyzed in the performance sec-tions, these two benchmarks have comparatively low snoophit rates. Therefore, the overall energy consumed is evenhigher compared to schemes without localized snooping. Forother benchmarks, even if the bus transactions consumemore power, our approach still benefits by avoiding the di-rectory indirections. Even if longer buses consume morepower for snooping transactions, the difference in energyconsumption is rather small. This is because, as comparedto the energy hungry routers, longer buses are more efficientas far as energy consumption is concerned.

5. RELATED WORK AND CONCLUSIONCache coherence designs to exploit the proximity of data

sharers have been proposed in [6, 7]. Williams et. al. [7]propose to add direct links in four directions of NoC routersto snoop sharers in direct neighbors. However, their schemedepends on specific application mapping to work and hasmore hardware overhead. There have been several prior ef-forts on utilizing buses to optimize the network on chip de-signs [8, 9, 10]. The difference between their work and oursis that we propose to use buses to optimize cache coherenceprotocols. Reconfigurable NoC designs have been proposed

in [12, 11]. [11] designed reconfigurable bus-based networksbased on inter-core data sharing. Kim et.al [12] proposesto reconfigure the network in order to suit for the specificapplication characters.In this paper, we proposed a novel hybrid NoC architec-

ture that takes advantage of both snooping and directorybased cache coherence protocols. We first investigated howapplication mappings can affect the performance of proxim-ity snooping schemes. We then explained the design of alocalized bus-based snooping cache coherence protocol. Wealso presented the design details of our configurable snoop-ing buses using the on-chip links. In order to reduce thedependency of the local snooping on application mappings,we further designed two algorithms to dynamically groupsharing cores into local clusters and build buses to connectthose cores. Our experiment results showed that the pro-posed techniques not only increase system performance butthey can also reduce energy consumption.

6. REFERENCES[1] Intel. From a few cores to many: A tera-scale computing

research overview.http: // download. intel. com/ research/ platform/terascale/ terascale_ overview_ paper. pdf.

[2] W. J. Dally and B. Towles. Route Packets, Not Wires:On-Chip Interconnection Networks. DAC, 2001.

[3] N. Jerger, et.al. Virtual Tree Coherence: LeveragingRegions and In-network Multicast Trees for Scalable CacheCoherence MICRO, 2008.

[4] M. R.Marty and M. D.Hill. Coherence ordering for ringbased chip multiprocessors. MICRO, 2006.

[5] K. Strauss, et.al. Uncorq: Unconstrained snoop requestdelivery in embedded-ring multiprocessors. MICRO, 2007.

[6] J. A. Brown, et.al. Proximity-Aware Directory-basedCoherence for Multi-core Processor Architectures. InProceedings of SPAA, 19, 2007.

[7] N. Barrow-Williams, et.al. Proximity coherence for chipmultiprocessors. In Proceedings of PACT, 2010.

[8] R. Das, et.al. Design and Evaluation of HierarchicalOn-Chip Network Topologies for next generation CMPs.HPCA, 2009.

[9] L. Cheng, et.al. Interconnect-Aware coherence Protocolsfor Chip Multiprocessors. ISCA, 2006.

[10] A. N. Udipi, et.al. Towards Scalable, Energy-Efficient,Bus-Based On-Chip Networks. HPCA, 2010.

[11] S. Akram, et.al. A Workload-Adaptive and ReconfigurableBus Architecture for Multicore Processors. InternationalJournal of Reconfigurable Computing , 2010.

[12] M. Kim, et.al. Polymorphic On-Chip Networks. ISCA,2008.

[13] A. S. Tanenbaum. Computer Networks. Prentice Hall Pub.,1999.

[14] M. M. K. Martin, et.al. Multifacets GeneralExecution-driven Multiprocessor Simulator (GEMS)Toolset. SIGARCH, Nov. 2005.

[15] S. C. Woo, et.al.The SPLASH-2 Programs:Characterization and Methodological Considerations.ISCA, 1995.

[16] V. Aslot, et.al. SPEComp: A new benchmark suite formeasuring parallel computer performance. Lecture Notes inComputer Science (WOMPEI2001), 2001.

[17] N. Muralimanohar, et.al.Optimizing NUCA organizationsand wiring alternatives for large caches with Cacti 6.0.MICRO, 2007.

[18] R. Mukherjee, et.al.Thermal sensor allocation andplacement for reconfigurable systems. ICCAD, 2006.

[19] H. Wang, et.al. Orion: A Power-Performance Simulatorfor Interconnection Networks. MICRO, 2006.

839

(a) (b) (c) (d) (e)

Figure 10: Stages of our bisection clustering algorithm to build localized clusters of cores on chip. There are16 cores running one application’s threads on an 8 by 8 CMP. After 4 iterations of bisections, the 16 coresare grouped into 6 clusters.

APPENDIX

A. ALGORITHMS TO BUILD SNOOPINGBUSES

In this section, we present our algorithms that group datasharing cores into clusters and then build snooping buses toconnect the cores inside a cluster. We assume that the infor-mation about which cores on the CMP are running threadsof a certain parallel program is exposed to our approach.This information can either be retrieved at thread mappingtime before application execution or be inferred from datasharing patterns collected by filters at run time. We first em-ploy a recursive bisection algorithm similar to [18] to buildclusters of sharers from on-chip cores. We define the termsused by our algorithms as below:Cluster: A cluster denoted by 𝐶(𝑝𝑖𝑑) is a rectangular re-

gion owned by the process 𝑝𝑖𝑑. A cluster must contain atleast one owner core in it.Density of a cluster: The density of a cluster 𝐶(𝑝𝑖𝑑) is

the ratio of the occurrence of owner cores to the number ofall cores in the cluster.Path: A path denoted by 𝑃𝑘(𝑝𝑖𝑑, 𝑐𝑙𝑢𝑠𝑡𝑒𝑟) is the 𝑘

𝑡ℎ path,where pid is the process id of owner and cluster owned bythe same process 𝑝𝑖𝑑. A path forms a snooping bus thatpasses connects all owner cores in the chip.Length of a path: The length of a path is the maximum

Manhattan Distance between two cores on the path.Weight of a path: The weight of a path is defined by the

number of owner cores on the path.

A.1 Clustering AlgorithmOur clustering algorithm recursively forms rectangles that

contain cores running threads of a same application, whichwe call the owner of the cluster. Clusters owned by one ap-plication do not overlap with each other, but clusters ownedby different applications may overlap. During the clusteringprocedure, we identify a core by its location on the chip as𝑁𝑟,𝑐, where 𝑟 and 𝑐 are respectively the indices of row andcolumn of the core on the chip. The range of a cluster isrepresented by {𝑟𝑜𝑤𝑏, 𝑐𝑜𝑙𝑏, 𝑟𝑜𝑤𝑒, 𝑐𝑜𝑙𝑒 }, where subscripts 𝑏and 𝑒 represent begin and end, respectively. Initially, eachcluster encloses all cores on the chip. After the maximumlength of a cluster is given as the input, the clusters arerecursively tightened by dividing each cluster into two clus-ters.This procedure continues until no clusters have edgeslonger than the maximum cluster size 𝐷.Figure 10 illustrates a clustering example of one applica-

tion with 16 threads running on an 8 by 8 NoC. The max-imum edge length 𝐷 is three. The locations of the appli-

cation’s threads are represented by black squares. A solidrectangle encloses a cluster and a dashed line represents abisection edge. A cluster region with a dashed line is goingto be divided into two clusters in the next iteration. At theinitialization stage, there exists only one cluster including allcores in the chip. The cluster remains the same after tight-ening, for each of all four edges is touched by at least oneowner core. During the first iteration, six possible bisectionpoints are explored, three for the vertical edges and threefor the horizontal edges. After that the bisection cluster-ing algorithm selects one bisection point to divide the initialcluster into two clusters as shown by the dashed line. At thesecond iteration, two bisection points are found inside boththe clusters considering that they all have edges longer than𝐷. Both clusters are further divided into two clusters each.This procedure continues until no cluster has edges longerthan 𝐷. Algorithm 1 describes our clustering algorithm indetail.Algorithm 1 Bisection Clustering.

INPUT: A 2D array of process core mapping 𝑀 and itsregion 𝑅INPUT: Process id 𝑝𝑖𝑑 and max diameter 𝐷

1: Create a empty queue 𝑅𝑖

2: Create a empty list 𝑅𝑜

3: Enqueue 𝑇 𝑖𝑔ℎ𝑡𝑒𝑛(𝑅) into 𝑅𝑖

4: while (𝑅𝑖 is not empty) do5: 𝑟𝑒𝑐 ← 𝑑𝑒𝑞𝑢𝑒𝑢𝑒(𝑅𝑖)6: Set 𝐿𝑒𝑛ℎ by the length of horizontal edge of the 𝑟𝑒𝑐7: Set 𝐿𝑒𝑛𝑣 by the length of vertical edge of the 𝑟𝑒𝑐8: if 𝐿𝑒𝑛ℎ ≤ 𝐷 and 𝐿𝑒𝑛𝑣 ≤ 𝐷 then9: Enqueue 𝑟𝑒𝑐 to the 𝑅𝑜

10: Continue11: end if12: if 𝑃𝑜𝑖𝑛𝑡𝑠(𝑟𝑒𝑐,𝑀) = 0 then13: Continue14: end if15: 𝑟𝑒𝑐1, 𝑟𝑒𝑐2← 𝐵𝑖𝑠𝑒𝑐𝑡𝐶 𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔(𝑟𝑒𝑐,𝑀)16: Enqueue 𝑇 𝑖𝑔ℎ𝑡𝑒𝑛(𝑟𝑒𝑐1) to 𝑅𝑖

17: Enqueue 𝑇 𝑖𝑔ℎ𝑡𝑒𝑛(𝑟𝑒𝑐2) to 𝑅𝑖

18: end while

OUTPUT: A set of clusters

A.2 Bus Building AlgorithmAfter we have grouped data sharing cores into clusters,

we use our bus building algorithm (Algorithm 2) to connectrelated cores to build snooping buses within a cluster. Theinputs of this algorithm are the clusters found by the bi-

840

Algorithm 2 Bus Building Algorithm.

INPUT: A set of clusters 𝐶𝑆INPUT: Maximum length of a bus 𝐿

1: Compute density of all cluters 𝐷(𝑐) =#𝑜𝑓𝑐𝑜𝑟𝑒𝑜𝑓𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑖𝑛𝑔𝑝𝑖𝑑

#𝑜𝑓𝑎𝑙𝑙𝑐𝑜𝑟𝑒𝑠𝑖𝑛𝑡ℎ𝑒𝑐𝑙𝑢𝑠𝑡𝑒𝑟2: Build a graph 𝐺, where a vertex is a core and an edgeis unoccupied link between two adjacent cores

3: while not empty(𝐶𝑆) do4: Pick up the densest cluster 𝑐 in the 𝐶𝑆.5: Set process id of the 𝑐 to 𝑝𝑖𝑑6: Build a list of cores 𝑁 that runs the process 𝑝𝑖𝑑7: Build a empty list 𝑏𝑢𝑠𝑒𝑠8: for core in 𝑁 do9: BuildBus(𝑐𝑜𝑟𝑒, 𝐺, 𝐿)10: end for11: Select the bus 𝑃𝑜𝑝𝑡 with maximum number of cores

running the process 𝑝𝑖𝑑12: Append 𝑃𝑜𝑝𝑡 to 𝑏𝑢𝑠𝑒𝑠13: Remove edges which are belong to 𝑃𝑜𝑝𝑡 from 𝐺14: if not all cores in 𝑁 belong to the 𝑃𝑜𝑝𝑡 then15: Build a core list 𝑁𝑟 containing the cores in 𝑁 that

do not belong to the 𝑃𝑜𝑝𝑡.16: Create a new minimum-size cluster 𝑐 𝑟 enclosing all

cores in 𝑁𝑟.17: Compute density of the 𝑐 𝑟 and plug it into the 𝑆𝐶

at the appropriate position.18: end if19: end while

OUTPUT: All buses built

section clustering algorithm and the maximum length of asnooping bus desired. The outputs are the buses that canbe used for cache snooping purpose.Algorithm 2 describes our bus building strategy in de-

tail. First we compute the densities of the clusters foundby the bisection clustering algorithm. Next, a graph 𝐺 isbuilt where vertices are the cores in the chip and edges arethe links connecting two cores. Edges are removed from thegraph 𝐺 if they are assigned to a specific bus. The algorithmalways starts with the cluster 𝑐 with the maximum density.Then 𝐵𝑢𝑖𝑙𝑑𝑃𝑎𝑡ℎ function build buses for the owner cores inthe cluster 𝑐, and selects the path 𝑃𝑜𝑝𝑡 with largest weight.Later links belonging to the path 𝑃𝑜𝑝𝑡 are removed from thegraph 𝐺 to ensure the links of two different buses do notoverlap. Owner cores on the path 𝑃𝑜𝑝𝑡 are marked as non-owners of the cluster 𝑐 before the cluster 𝑐 is tightened. Ifthe region of the cluster 𝑐 is connected by buses, it is re-moved from the cluster set 𝑆𝐶. The procedure continuesuntil the cluster set 𝑆𝐶 becomes empty.

B. IMPACT OF APPLICATION MAPPINGSON THE EFFECTIVENESS OF SNOOP-ING BUSES

B.1 Mapping of parallel programs onto a CMPplatform

In order to take advantage of the large number of coresavailable in a many-core system, programs are usually splitinto multiple threads that can be executed in parallel. Theideal mapping of an application’s threads is to place themas close as possible to exploit the communication locality.

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

(a) Mapping 1 of applica-tions.

0

0

0

0

1

1

1

1

0

0

0

0

1

1

1

1

00

00

1

1

1

1

0

0

0

0

1

1

1

1

(b) Mapping 2 of applica-tions.

Figure 11: Two schemes of application mappings.

However, in reality, such mappings may not be possible. Forexample, we have a queue of parallel programs each can beexecuted in different number of threads (2, 4, and 8 etc.).Initially, when most of the cores are available, we can chooseto map the threads of a same application to cores as close aspossible, resulting in mappings close to the ideal ones. Sincethe threads of different applications take different length oftime to finish, there will be holes (of available cores) gen-erated in the CMP in the later time. A similar scenario isthe dynamic memory allocation in the Operating Systemswhere holes of free memory are generated when previouslyallocated memory is returned. In such cases, even if we stilltry to map an application’s threads to cores next to eachother, the resulted mapping will not be a close-knit group,but several local clusters scattered across the CMP.As a result, if the bus configurations are fixed as in [10] and

[7], the effectiveness of the snooping will be compromisedsince the number of possible shares searched is not relatedto the bus length. In another word, even if we increasethe length of snooping buses, the sharers found may notincrease accordingly. This motivated our work that we builddynamic buses based on the location of cores sharing data.Because we intelligently build buses that connect sharerstogether using our algorithms described in section A, in ourscheme, the longer buses are guaranteed to connect moredata sharing cores together.

B.2 Analysis of experimental resultsWe experimented with two application mappings in order

to further validate our motivations. Figure 11 illustratesour two types of mappings. Mapping 1 depicts an idealmapping, where all threads of an application are assigned tocores close to each other in each row. In mapping 2, onlya pair of threads of an application are placed next to eachother per row. We use the two types of fixed buses to snoopsharers as illustrated in Figure 4 (a).Figure 12 and Figure 13 plots the impact of application

mappings on performances for both types of bus configura-tions. We observe that, when the application mapping isideal as in mapping 1, bus configuration of fixed-4 achieveshigher performance compared to fixed-8. In all of the bench-marks, the IPC of fixed-4 is higher compared to that of fixed-8. The reason is that short buses are efficient enough to finddata sharers and incur lower bus delays for ideal mappings.The benchmark equake exhibits the most significant differ-ence in terms of performance between these two types of busconfigurations. On the other hand, even if longer buses canincrease the performance compared to the baseline direc-tory, they become overkill for mapping 1 and lead to highersnooping cost. However, if the mapping is not ideal as inmapping 2, longer buses bring more performance benefitsthan short buses as shown in Figure 13. Performance offixed-4 in some benchmarks is even worse than the baselinedirectory, such as in art, barnes and fma3d. This is because

841

0.8

1

1.2N

orm

aliz

ed IP

C baseline directoryfixed-4fixed-8

Performance of Mapping 1 with Fixed Bus Configurations

Figure 12: Normalized IPC with mapping 1.

0.8

1

1.2

No

rmal

ized

IPC

baseline directoryfixed-4fixed-8

Performance of Mapping 2 with Fixed Bus Configurations

Figure 13: Normalized IPC with mapping 2.

0 0.4 0.8 1.2

Nor

mal

ized

load

m

iss l

aten

cy

baseline directory fixed-4 fixed-8

Load Miss Latency of Mapping 1 with Fixed Bus Configurations

Figure 14: Load miss latency with mapping 1.

00.40.81.2

Nor

mal

ized

load

m

iss

late

ncy baseline directory

fixed-4fixed-8

Load Miss Latency of Mapping 2 with Fixed Bus Configurations

Figure 15: Load miss latency with mapping 2.

the short buses are not able to find sharers and add moredelays in addition to directory indirections.The impact of application mappings on load miss latency

is shown in Figure 14 and Figure 15. We observe that if themapping is ideal, buses of configuration fixed-4 can reducethe load miss latency by about 20% compared to fixed-8configuration on the average. However, in mapping 2, theload miss latency of fixed-4 is 12% more than the fixed-8 buses. Even though fixed-8 buses have longer delays inbus transactions, the overall miss latency is still reducedcompared to fixed-4 for mapping 2.Our results in this section prove that the effectiveness of

snooping buses not only depends on the lengths of the buses,but also is affected by application mappings. Fixed buses are

not able to adjust with flexible application mappings, andthus cannot bring guaranteed performance benefits. On thecontrary, our proposed scheme can take the application map-pings into consideration when constructing snooping buses.As a result, our scheme can improve the performance andreduce the snooping overhead at the same time.

842