An Efficient Hybrid Cache Coherence Protocol for Shared Memory Multiprocessors

7/30/2019 An Efficient Hybrid Cache Coherence Protocol for Shared Memory Multiprocessors

1/8

__---1996 International Conference on P arallel P rocessing_ _ _ _ _ _

An Efficient Hybrid Cache Coherence Protocol forShared Memory Multiprocessors *Yeimkuan Chang and Lasimi N. BliuyanDepart~meiitof Compuk r ScienceTexas A kM TlniversityCollege Station, Texas 77843-3112E-mail: {ychang, bhuyari }iQcs.tamu.edu

Abstract - This paper presents a new tree-basedcache coherence protocol which is a hybrid of the lim-ited directory and the linked list schemes. By utilizinga limited number of pointers in the directory, the pro-posed protocol connects the nodes caching a sharedblock in a tree fashion. In addition to t he low coinmu-nication overhead, the proposed scheme also cont#ainsthe advantages of t he existing bit-map and tree-ljasedlinked list protocols, namely, scalable memory require-ment and logarithmic invalidation latency. We evalu-ate the performance of our protocol by running fourapplications on an execution-driven simulat,or. Oursimulation results show that the performance of theproposed protocol is very close to that of the full-mapdirectory protocol.1 IntroductionSeveral cache coherence schemes have Leen pro-posed to solve the cache consistency problem in sltaredmemory multiprocessors[l]. Most of the popular cachecoherence protocols are based on snooping on the busthat connects the processing elements to the mem-ory modules [2 . But the obvious limitation to suchbe suppor ted by a single bus. The single bus be-comes the bottleneck in the system. To maze sliaredmemory multiprocessors scalable with respect to alarge number of processors, non-bus-based networkssuch as point-to-point networks and rnult,istage inter-connection networks are normally employed. Sincethe broadcast procedure generates a lot of trafficon networks, non-broadcast based directory proto-cols are used to impleniknt cache coherence Full-mapand linked list schemes are two categories of directoryprotoc.ols[3, 41.

The full-map diredo ry scheme maintains a bit mapwhich contains th e informat.ion abou t which node inthe system ha.s a shared copy of an associated tllock.When a read or write inks occurs, a request is sent,to the home memory module as dct,erinined bv t#headdress of the requested data. Upon receiving tlLe re-quest, the home memory module sends a reply .dongwith the d ata t o the requesting node. Thus, it takestwo messages to serve a read miss request. However,

schemes is the 1mited number of processors that ca.n

*Thisresearch has been partly suppo rted by NSF grant MIP-9301959.

th e storage overhead necessary to maintain t he direc-tory is large, and becomes prohibitive as the size ofthe system grows. Also, the latenc) of cache t.rans-actions is Usuiilly larger since these systenis d o nothave a broadcasting medium like a shared bus to sendinvalidation signals. The limited directory approach[3 , *5 ] limits tlie number of pointers associated witheach block in order to keep the d ire ct xy size manage-able. However, t.his approach also limits th e numbe rof processors that can share a block. The exist,ingschemes are discussed in more detail in Section 2 ofthis paper.One way to reduce the storage overhead in th edirectory scheme is to use linked lists instead of asparsely filled table to keep track of riultiple copies ofa block. The IEEE Scalable Coherenr, Interface ( X I )standard project [4,1 applies t8his pproach to definea scalable cache coherence protocol. In this approachthe storage overhead is minimal, but maintaining thelinked list is complex and tim e consuming. T he proto-col is oblivious of the underlying interconnection net-work and theiefore, a request may be forwarded toa distant node although it could have been satisfiedby a neighboring node. The major disadvantage isthe sequential nature of the invalidation process forwrile misses. The scalable tree protocol (STP ) [7]and the SC I tree extension protocol [3] were proposedto reduce t,he latency of wrht: misses. Th e low la-tency of read misses is sacrificed in order to construct,a bitlanced tree connecting all the shared copies of acache block. Th e large number of nicssages generatedfor read misses, however, makes these protocols pro-hibit,ive for an application with a smaller degree ofdat a sharing.I n this paper, we propose a iiew tree-based cachecoherence scheme for shared memory multiprocessors.The proposed scheme aims at r eduiing the 1at;encyof bo t h rea,d and write misses. Th e main idea is toutilize the sharing information available from the lim-ited number of point,ers in the directory in forming anappropriate nwnber of trees. I t is a hybrid of t8he im-ited directory and the linked list protocols with onlyforward pointers. Th e Froposed protocol has the ad-vantages of the bit-map protocol and the tree-basedlinked list protocol, naraely, small read miss latency(two messages), logaritlimic write latency, and seal-able directory rnernory requirement.

1-1720190-3918/96 $5.000 996 IEEE
http://iqcs.tamu.edu/http://iqcs.tamu.edu/


2/8

1996 International Conference on Parallel Processing-Th e rest of this paper is organized as follows. InSection 2, existing schemes are discussed. The detaileddesign of the proposed tree-based directory protocol isprovided in Section 3. Performance comparisons be-

tween different protocols are given in Section 4, byusing an execution driven simulation. Finally, con-cluding remarks are presented in Section 5 .2 Discussion on Existing SchemesExisting directory schemes fall into two categories.namely bit-map and linked list protocols. A nomeis-clature, DiriX, was introduced in [3] for bit-map co-herence protocols. The index i in DiriX represents thenumber of pointers for recording the owners of sharedcopies, and X is either B or NB depending on whethera broadcast is issued when the pointers overflow. Weintroduce a new notation DiriTreek for the linkedlist protocols that will cover all the existing linkedlist protocols. The subscript i in Diri represents thenumber of pointers in the directory and subscript kin Trec'k represents the number of pointers in the treestructure. For example, Stanford's singly linked listprotocol [GI and SCI [4] belong to DirlTreel becausethey have a single pointer in the directory pointing tothe head of the list. Note th at DiriTreek does not dis-tinguish between sin ly linked list protocol (i.e., withonly fclrward pointer? and double linked list protocol(i.e., with both forward and backward pointers). Th eindex i of DiriTreek represents the number of nodeshaving shared copies in their local caches. ST P [7] be-longs to DirzTreek because it maintains a $-ary treeand keeps pointers to t he root of the tree and the latestnode joining the tree. Similarly, the SCI tree extension(P159f1.2 SI) belongs to DirzTreea because it main-tains a balanced binary tree and keeps two pointers,one to the root of the tree and the other to the head(latest node joining th e tree). Our tree-based protocolis a DiriTreek scheme with only forward pointers.2.1 Bit-map SchemesA. FuU-Map (Dir,NB )In this scheme, R bits are associa.ted with each memoryblock, one bit per node. If a copy of the shared blockis contained in the local cache of a node, the presencebit corresponding to tha t node is set. The directoryalso haLsa dirty bit. If the dirty bit is set, only onenode in the system has a copy of the correspondingshared block.The advantage of t*his cheme lies in that only thenodes (caching t,l C: block receive the invalidation mes-sages. The disadvantage is the large directory size.The amount, of the directory memory in the n-nodesystem is B .n2 bits, where B is the number of sharedblocks in each node.B. Limited Directory SchemesThe main idea behind these schemes is based on theempirical results t ha t in mos t of the applications, onlya small number of processors share a memory blockmost of the time. Thus, a limited number of pointersin the directory will perform as well as the full-mapscheme: most of the !,ime. The advantages of havinga limit4ednumber of pointers are the scalable mem-ory requirement and faster hardware suppor t. If thepointers are not sufficient to record all the nodes hav-ing shared copies (i.e., pointer overflow), a mechanism

must be employed to deal with it. Th e memory re-quirement in a limited directoi y scheine is B + i R lognin an n-node system, where each node has B blocks ofshared memory and i is the number of pointers in thedirw vy.'0 imited directory schemes, DiriB and DiriNB,h n l " been proposed in the literature[3]. The broad-

CZ-- scheme DiriB employs a,n overflow bit t80 han-dl6, ;:'tinter overflow. If there is no pointer in thedi I "tory available for subsequent requests, the over-f l i 1": bit is set. The n, invalidation messages will bebr')adcast to all the processors in the system to main-tain cache coherence when a write miss occurs. Th isscheme performs poorly if the number of sha red copiesis jus t greater than the number of pointers. Thenon-broadcast scheme DiriNB avoids the broadcastdesigned for solving the pointer overflow problem inDiriB by invalidating one OC the processors pointed bythe pointers and replacing it with the current request.This scheme does not perform well when the numberof shared copies is much greater than the number ofthe pointers.In LimitLESSi [5] and DirlSW [9], the pointer over-flow problem is solved by software. All the pointersthat can not fit into the liiniirt:d hardware-supporteddirectory space are stored in traditional memory bythe software handler. The delay in calling the soft-ware handler is their major disadvantage.2.2 Linked List SchemesSingly Linked List ProtocolIn this protocol [6], a list of pointers is kept in the pro-cessors caches instead of main memory. Each sharedblock only keeps a pointer to a node which containsa valid copy of the block. The node called the head ,pointed to by the home niernctry module, is the last onewhich accesses the Corresponding shared block. Th ehead in turn uses its pointer to point to another nodewhich also has a valid copy. Continuing the abovepointing process, a singly linked list is formed. Th elast node in the list , called the t a d , points back t o thehome memory module.On a read miss, a request is first sent to the homememory module. The memory module informs thehead to supply the requested block to the requester.Tn the meantime, the memory updates it s pointer t opoint to the requester. Upon receiving the block, therequester points to the supplier. The requester nowbecomes the head of the list.On a write, the request is again sent to the homemeinory module. The memory module then followsthe pointers on the linked list to invalidate all the validcopies in the system. Upon receiving the invalidationmessage, the head supplies the requested block to therequester. The tail sends an acknowledgment to therequester to indicate the completion of the invalidationprocess. The directory memory requirement for thisprotocol is (C+B) .n og n bits. where B and C are thenumbers of memory and cache blocks in each node.Scalable Coherent InterfaceScalable Coherent Interface (SCI) is a n IEEE standard(P159G) [4]. It, is based on it doubly linked list. Ona read miss, the reading cache sends a request to thememory. If the list is empty, t,he memory points t othe requester and supplies the data . Otherwise, i.he

1-173


3/8

--996 International Conference on P arallel P rocessing_ _ __ _ _ _ _ _ _old head of the list is returned to the requester. Afterreceiving the reply from home memory, the read re-quester sends a new request to the old head of the list.The old head returns the requested data and updatesits predecessor poin ter to th e requester. The requestersets its successor pointer to the old head and becomesthe new head of the list.On a write miss, the requester puts itself as the newhead of the list as in the read miss situation. Thenit sends a n invalidation message to its successor a n dwaits for for an acknowledgment. After its successoris invalidated and taken out of the list, the requesterupdates its successor pointer to the successor of itsold head and continues the same invalidation process.It takes 2 P messages t o invalidate a list of P cachedcopies. Adding the four messages for inserting itselfas a new head, the requester takes 2 P + 4 messages toget th e write permission.Scalable Tree ProtocolThe scalable Tree Protocol (STP [7] uses a top-downtree as an example. The first node issuing a read re-quest t o a specific memory block will be t he root of th etree. The 211d and 3rd nodes issuing read requests willbe the children of the first one. Similarly, the 4th and5th nodes making a read request will be the childrenof the 2nd node. Continuing the same procedure, abalanced tree is formed. The invalidation process fol-lows the tree structur e and can be done in logarithmictime.This protocol attains a logarithmic invalidationprocess by constructing a balanced tree, but payingthe price of generating too many messages for readmisses. Since most of the requests in an applicat ionare read misses, the protocol performs poorly whenthe degree of dat a sharing or write misses is low.SCI Tree ExtensionThis scheme is proposed as an IEEE standard exten-sion (P1596.2) of SCI [8]. It constructs a balancedtree by using AVL tree algorithm. This scheme has aread miss overhead similar to S TP. Thus , it does notperform well for the applications with a low degree ofdata sharing and less frequent write misses.We summarize the number of messages generatedby a read or a write miss for the various protocols inTable 1. The pros and cons of each protocol are alsogiven in Table 2 . The Dir4Treez is an example of thenew protocol, proposed in the next section.

approach to construct a balanced ree. Take a binary

3 The New Cache Coherence ProtocolWe propose a DirLTreeA cache coherence protocolthat combines a limited directory scheme with a tree-based scheme. T he design of the protocol aims at min-imizing the conimunication overhead for constructingthe tree struc ture when a read miss occurs, and for in-valida ting the copies of the shared memory block whena write miss occurs. We begin by discussing the direc-tory structures for cache and memory blocks. Then,coherence act ions are described for read misses, writemisses, and block replacements.A. Directory StructureThe proposed scheme maintains many optimal ornear-optimal trees for all shared cache blocks. Wecall it a DiraTreek scheme because i k-ary trees are

maint.ained. The indices i and k of DiriTreek indi-cate the number of pointers in each memory block andcache block, respectively. Th us , DiriTreek employs ipoint,crs in a. memory block a,nd constructs k-ary treespoinivl to by these i pointers. As an example, theorg :;ization of the trees with 14 shared copies con-sti .,tedfor the Dir4Tree2 scheme is shown in Figure 1,wli - .(> the numbers in the circles denote the arrivings c ~ : rice of the read requests. The construction of th et r , ; s explained in detail later under read miss. Theii 'iiiory requirement is B . i 2i ogn +C .k log n inai ; n-node system, where I3 and C are the numbers ofmc,mory and cache blocks per node, respectively.

Memory_ _ _ _ ~

Figure I: The organization of trees constructed forDireTreez.The empirical results in [IO] suggest that in manyapplications, the number of shared copies of a cacheblock is lower t,han four, regardless of the system size.Thus, we feel comfortabe in :jsirig i = 4 and k = 2to construct binary trees in {,hisstudy. The write op-eration can be implemented by employing either aninvalidation or an update protocol. We use an inval-idation protocol with a starong consistency model inthis paper. Figure 2 shows the structures of cacheand memory blocks. Th e va.riable level in the mem-ory block is used t o record t ~he eight of the trees, andfacilitates constructing riear-apt,imal rees.

(a) (b )Figure 2: The structures of cache arid inemoiy blocks.B. The Protocol and its Coherence OperationsThe states of cache blocks are E (ezcbuszve), V ( u a l z d ) ,and IV (mvabad) , R M l P (Read Miss In Process),WM I P (Write Miss In Procers), and INV-IP (Inval-tdation In Process) The d a t e transition diagram ofcache blocks is shown in Figure 3 . R M I P , WM IP ,and I N V l P are transient stales. In general, the co-herence operations are similar to those in the full-mapprotocol

1-174


4/8

1996 International Conference on Parallel P rocessing

ProtocolFull MapUzriN B

LimitLFSSd,

Table 1: Number of messages generated by a read or write iiiiss for various schemes, where Y s the number ofprocessors that access the memory block under consideration.

Pro ConSimple to implementNo replacement overheadLow read miss overheadSimple to implementLow memory overhead Sequential invalidationLow read miss overheadLow memory requirement Sequential invalidatioo(hardware)

High memory overheadSequential invalidation processHigh invalidation overhead

Hinh read miss overt-iertdslow software handlerSingle Link Chain Moderate memory overhead Sequential invalidatiofi-Double Link Chain Moderate memory overhead Sequential inval idatG iT-SCI extension Logarithmic invalidation High replacement overhead

DiriTreek Low read miss overheadLogarithmic invalidationLow memorv overheadTable 2: Pros and Cons for various protocols.

Since we use a strong consistency model, the stateof a cache block which sends invalidation messages toits children is changed to W M J l P and waits for theacknowledgments. The transient state W M l l P forcache blocks does not exist in the full-niap protocol.Two kinds of invalidation messages are shown in Fig-ure 3. [ N V is used for the regular invalidation mes-sages, its in the full-map protocol. Replace-INV isused for the coherence operations for cache replace-ments and will be explained in detail later.The states of the memory blocks are the same asthose in the full-map directory protocol. Figure 4shows i,he state transition diagram of the memoryblocks. The memory transient states are RM-WW(Read Miss Waiting for Writeback), WM-WW WriteMiss Waiting for Writeback), and W M l l P I riteMisss Invalidation In Process).The major differences between Dir ,Treer , and thefull-map protocol lic i n how the tree is constructed byusing the limited number of pointers arid in the ac-tions taken for block replacements. As in the full-mapdirectoIy protocol, the requested block is always pro-vided by the home node. We discuss the read miss,

wi:e miss and the coherence operations for cache re-placwnents in detail below.Read miss: A read request is said to be a missif the cache controller find:; that the requested data isnot in any cache block, a he cache block containingIhe requested data is in invalid state. When a readmiss occurs, a local cache is first selected for replace-ment. The request is then passed over the network tothe hoine memory modulc. The operations t o servea read miss are the same as i n the limited directorybchcme if a null pointer in the directory is available forthe request. Otherwise, two pointers are selected a ndsent, to the requesting node along with the requesteddata. The processors which were pointed to by theselected pointers will become the children of the re-questing processor. One of these two pointers is setto point to the requesting processor and the other isset to null. Figure 5 shows ho w a trree is constructedwhile the fifteenth request arrives at the home mem-ory module in Figure 1. It can be seen that after theread miss is completed, processors 11 and 13 becomethe chiltlreri of processor 15.ligure 6 lists in detail the cohvrence operations for

1-175


5/8

1996 International Conference on P arallel P rocessing

N - O K ifnot eartplete

Figure 3: %at e transition diagram of the cache blocks.

\dWY OK

Figure 4: State transition diagram of the memoryblocks.serving a read miss at the home memory module. Fourdifferent situations are considered in Figure 6. First,it checks whether or not the processor has been al-ready recorded. This situation might occur when thecached block in a processor was replaced and later onthat processor issues a read request again. The secondsituation considers the case when a processor has aread miss the first time , and there is an em pty pointeravailable. Th e third and forth pa rts consider the caseswhen there is no pointer available in the directory forthe next incoming read request. If there are two point-ers point.ing to two trees with the same height, thesetwo pointers will be sent to the requesting processorand the processors pointed to by these two pointersbecome the children of the requesting processor. Fi-nally, on e of these two pointers is set to point to th erequesting processor a nd th e other is set t o null. Th elast situation considers the case when there are no twopointers which point t o the trees with the same height.The pointer with the smallest level will be selectedand sent to the requesting processor. The processorpointed to by the selected pointer becomes the only

Figure 5: Message movements for a read miss.for (i =0. .3)if (p[i] ==requester) {(data, null, null) -+requester; return ; }

Figure 6: Cache coherence operat,ions for a read miss.

child of the requesting processor. Then the selectedpoint,er is set to point to requesting processor and thelevel of the pointer is incremented by one.Notmehat Figure 6 only-shl>ws the high level algo-rit2hm or dealing with a read miss. It' is possible toimplement an efficient hardware design for this opera-t,ion. Unlike the limited directory, DiriTreek does notrely on broadcast, or generate any unnecessary inval-idation messages. DiriTreek does not have th e highoverhead caused by a software trap iised by the Lim-itLESS schemes.Since there are only limited number of pointers inthe directory, trees generated by DiriTreek are notbalanced. Subsequently, we base on a fixed num-ber of processors sharing a memory block and discusshow balanced are the trees generated by the proposedscheme.Consider the DirzTreez scheme first. Two point-ers, PO and P1, are in the directory. Let N l ( j ) an dN Z ( j ) be the number of processors in the j-level treepointed to by PO and P1, respectively. Table 3 showsthe expressions of N l ( j ) and iV 2 j) derived from Fig-nre 6. The expressions of ,V(j\ and Nz ( j ) can besimplified a.sj nd j ( j +1)/2,. respectively. Similarly,the expression of Ni(j) fo r DiriTreea can be derivedas 2j - . +~ j k ~ ~ ( N i - l ( / e )t ) . Table 4 lists the max-imum nuniber of processors caching a memory blockversus the level of the trees for the proposed schemes,

1-176


6/8

1996 International Conference on Parallel P rocessing~ - ~ _ _ _ -~ _ _ _ ~ -_ ~ ~

Table 4 : Maximum number of nodes constructed inDir2Trel.a and Dir4TPee2 as a function of level.

Dir2Tre32, DireTreez, and SCI or STP with binarytrees. We can easily check from the first row of the ta-ble that when there are 16 processors caching a mem-ory blotk using the Dir4Treea scheme, pointers 0 and1 point to a tree with 7 nodes and pointers 2 and 3point to a singly node. If a 1024-node system is built,the biggest tree maintained by the Dir4Treea schemeis of 12 levels which is only one level more than thebalanced binary tree.Write miss: When a write miss occurs, the writerequest is first sent to the home memory module. In-vali dati m messages are then sent o ut to the root nodesof the uees by following the pointers in the direc-tory. The other nodes caching the da ta are invali-dated Ey the messages originating from their corre-sponding roots. In order to speed up the invalidationprocess further, the nodes pointed to by odd num-bered pointers receive invalidation message from thenodes pointzed to by even numbered pointers. Thehome memory module only receives at most half thenumber of ackiiuwledgments and thus, the possibilityof the home node becoming a bottleneck reduces. Anexample of a write miss operation is shown in Fig-ure 7, where 15 shared copies are in the system be-fore a ur it e miss occurs. The invalidation messages tonode 15 originate from nodes 9. The acknowledgmentswhich are omi tted from the figure to preserve clarityfollow the reverse direction of the invalidation paths.It can he seen that Dir4Tree2 has a 3-level tree whichis shorter than the 4-level binary tree with te n nodesmaintained by an STP protocol with binary trees orth e SC I tree extension.Replacement Operation: When a miss occurs, acache block must be selected for storing the requesteddata before a request is sent to the home memory

Figure 7: Messagc movement:; for a write miss. (Forclarity, acknowledgments are omitted.)

module for service. If the selected cache block cur-rently holds a valid or exclusive copy of data witha different address, a replacement operation needs tobe performed. We propose that when a valid or ex-clusive cached block is being replaced, the subtreerooted a t the replaced cache block be invalidated with-out informing the home directory. The message t,ypeReplace-INV is used for replacement operation to tlis-linguish I N V generated by write misses because noacknowledgment is needed for replacement. Th e ra-tionale of doing this is as follows. First, as noted in[ lo] ,most of the t ime, the ruimber of shared copies of amemory block is less than four Thus. our replacementoperations will perform as well as the bit-map schemebecause the replaced cache block does not have anychild most of the time. Second, even when the treesgrow bigger, most of the replaced cache blocks arepositioned as the leaf nodes of the trees. Thir d, thereplacements are not frequeni, if the set size of an asso-ciat,ive cache memory increases. It is possible that oneof the roots may be replaced and causes some commu-nication traffic if one of it,s children issues a requestlater. However, the proposed replacement action issimple and easy to implement. It is worthwhile tonoti. that the only possible communication overheadof the proposed scheme comes from the replacements.

4 Performance EvaluationWe use four real applications to compare the per-formance of the proposed DirJreek coherence schemewith that of the full-map a.nd the limited directoryschcmes. Th e app1ic;itions comprise MPSD, LU de-composition, the Floyd Washall algori thm, and a FastFourier Transformation program (FFT). We give abrief description of each prograni indicating its pur-pose and the data structure employed a,s follows.

1-177


7/8

1996 International Conference on P arallel Processing

Network bandwidth

Data cacheBlock Size

8 bitsCache Associativity I Fully AssociativeNetwork tvDe I binarv n-cube' Switch/Wire DelayMemory Access LatencyCache Access Latencv

11 cycle5 cycles1 cvcleTable 5: Simulation Model.

LU I 135.0112.01l0.0

h5 108.0$106.0

8 104.0.*c8 102.085 ima5 98.0

f m L 4 4 2 1 fmLSL4 4 2 1 fmL8L4 4 2 1L+16 32o. of Processors 8

Figure 8: Normalized execution time for MPSD.

4.1 Siniulat on MethodologyWe ported the proposed coherence scheme toProteus[l l] which is an execution driven simulator forshared memory multiprocessors. The simulator canbe configured to either bus-based or k-ary n-cube net-works. Th e networks use a wormhole routing tech-nique. The specification of the simulated networkand the cache memory is given in Table 5. We com-pare the normalized execution time for each applica-tion running with the various schemes as mentionedabove, where the normalized execution time is definedas the relative execution time to that of the full-mapscheme. Th e examined schemes are Dir,NB, DiriNBand DiriTreez for i = 1 ,2 ,4 ,8 . The results are plot-ted in Figs. 8 through 11 for various applications.The full map scheme is denoted by fm, the limited di-rectory schemes by L8, L4 L2, L1 and the DiriTree2scheme is represented by 8, 4, 2 and 1.

MP3D: The MP3D application is taken from theSPLASH parallel benchmark suite [ l a ] . It is a 3-dimensional particle simulation program used in thestudy of rarefied fluid flow problems. MP3D is no-torious for its low speedups. For our simulation, weused 3000 particles and r an t he application in 10 steps.The results are given in Figure 8 for 8, 16, and 32 pro-cessors. Comparing the full-map and limited directoryschemes in th e 8 and 16-node system, the performanceof the full-map scheme is the best. It is shown tha tDirqTreez is only less than 5% slower than the full-map scheme and much faster than the limited direc-

j r V u fmL4L21.14 2 1 fraIAL2LI 4 2 1 fmL4LZLI 4 2 1LvJ \--J ,~.-

16 32o . d h w e s s m 8fm f u l lmapB. ir*Nl?yhqE2 Figure 9: Normalized execution time for LU2:Di$rcsz1: ol\Tre%tory schemes Dir4NB and DirsNB. In a 32-node sys-tem, the performance of DirzTreez and Dir4Tkeez arebetier than all other schemes.LU Decomposition: The LU application is alsotaken from the SPLASH parallel benchmark suite [ l a ] .It is a parallel version of dense blocked LU factoriza-tion without pivoting. The data st,ructure includes twodimensional arrays in which the first dimension is theblock to be operated on, and f8he econd contains allthe data points in that block. We use a 128x128 nia-trix in our simulation study. Figure 9 shows the per-formance results for LU. Jt can be seen that DirlNBperforms worst in all the cases. In a 8-node system,DirlTree2 performs better than all other schemes. Inthe l6-node system, Dir4NU, DirlDee2, DirzTreea,and Dir4Tree2 perform as well as the full-map scheme.In the 32-node system, surprisingly, Dir4NB has thebest performance. DirzTreea and Dir4Treez also per-form better than the full-map scheme.Floyd Washall: Floyd Washall is a program thatc,orriputes the shortest, distance between every pair ofnodes in a, network. The network employed is a rnn-doni graph of 32 nodes. The hasic data structures inthe Floyd Washall algorithm are 2-dimensional arraysfor representing the predecessor matrix and the dis-tance matrix. An additional 2-dimensional array isalso used for recording the computed path. Each pro-cessor is responsible for updating a few rows of the dis-tance matrix. The entire matrix is dedared as a sharedarray. Updating the dist,ance matrix requires readingthe entire shared array, which incurs a large degreeof dat a sharing. Figure 10 shows the performanceplol. for the Floyd Washall program. DiraTree2 andDir4Treez perform very closely to the ful l-map scheme.The performance difference bet.ween Di r4Tree~ nd thefull-map scheme is less than 2% .FFT: Figure 11 gives the results for t,he FF T applica-tion. Except DirlTreez, all the other schemes performvery well. However, the proposed schemes Dir4Tree2and Dir,Treez perform better than t.he full-map andthe limited directory schemes. The improvement incase oC 1,hcproposed schemes. increases when the sys--tein becomes bigger. The irnprovement stems from t hefact that, not much cornniunication overhead is caused

1-178


8/8

1996 International Conference on P arallel ProcessingFloydWarshall References125.0 [l] D. J . Lilja, Cache ccherence in large-scaleshared-memory multiprocessors: Issues and com-parisons, ACM Coniputing Su r v e y s , pp . 303-

0 338, Sept. 1993.[ 2 - Q. Yang and L.N. Bhuyaii, Analysis and Com-;83 110.0 -3rison of Cache Coherence Protocols for a;)rket,-Switched Multiprocessor In I E E E Trans-lions o n Computers, vol. 38, no. 8, pp. 1143-

.3 u 153, August 1989.[ J. D . Chaiken et.al., Direct,ory-Based Cache Co-lierence in Large-scale Multiprocessors, Com-puter, vol. 23, no. 6 , pp. 49-58, June 1990.

No.ofPrwessors 8 16 32 [4] IEEE, IEEE Std 1596-1992: I E E E Standard forSca.lable Coherent Interface, IEEE, Inc., 345 East47th Street, New York, NY 10017, USA., Aug.

[5] D. Chaiken, J. Kubiatowicz, and A. Agarwal,LimitLESS Directories: A Scalable Cache Co-herence Scheme, ASPLOS-IV Proceedings, pp.224-234, April 1991.[6] M . Thapar, B. Delagi, and h f . J . Flynn, LinkedList Cache Coherence for Scalable Shared Mem-ory Multiprocessors, In Proc. r v f c m a t i o n a l Par-allel Processing Symposium ( IPPS) , pp. 34-43,April 1993.[7] 1.Nilsson and P. Stenst8rorn,The Scalable TreeProtocol - A Cache Coherence Approach forLarge-scale Mult~iprocessors, n Proc. Ixterna-t ional Symposium o n Parallel and DistribuiedProcessing, pp. 498-506, December 1992.[8] R . E. Johnson, Eatending t h e Scalable CoherentInterface for Large-Scale Shared-Memory Multi-processors, PhD th::sis, University of Wisconsin-Madison, 1993.

- 20.0E 115.0.*8-0 105.0

g 100.02

95.0

Figure 10: Normalized execution time fo r FloydWash all. 1993.

\--7/-J \---- 32f m L A L 2 8 4 1 f m L 4 L 2 8 4 1 L 4 L 2 8 4 1

No. f Processors 8 16Figure 11: Normalized execution time for FFT.

by replacements.5 ConclusionIn this paper, we proposed a new tree-based di-rect,ory cache coherence protocol for shared memorymultiprocessors. The proposed protocol combines thefeatures of the limited directory schemes with tree pro-tocols. It utilizes a limited number of pointers to con-struct i,rees to reduce the directory size and invali-dation latency. Compared to the STP and the SCItree extension scheme, the proposed scheme has lowerread miss overhead, which is just two messages. Atthe same time, it retains the low invalidation proper-ties of ii tree protocol for large degree of sharing. Thetrees constructed hy the proposed scheme are nearlybalanced. Execution driven simulation shows that theproposcd scheme is very close in performance to th efull-map scheme. When the number of processors islarge, the new scheme even performs bei8ter han thefull-map scheme in some cases. i l t the same time, ourscheme requires less directory space than the full mapscheme.

[9] M. Hill and et al., Cooperative Shared Memory:Software and Hardware for Scalable Multiproces-sors, ASPLOS- V Proceedzngs, pp. 262-273, Oc-tober 1992.[ lo] W.-D. Weber and A Gupta, Analysis ofCache Invalidation patterns in Multiprocesors,ASPLUS-III Proceedzngs, pp. 243-256, 1989.[L1] E. A . Brewer, C. N . Dellarocas. .4. olbrook, andW. E. Weihl, PROTETJS: A High-PerformanceParallel Architecture Sinidator, Technical Re-

port MIT/ICS/TR516, MIT, 1991.[I21 J. P. Singh, W. D. Weber, and A . Gupta,(SPLASH: Stanford Parallel -4pplications forShared Memory, Technical Report CSL-TR-92-526, Stanford University, 1992.

1-179

Documents

An Efficient Hybrid Cache Coherence Protocol for Shared Memory Multiprocessors