A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors

7/30/2019 A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors

1/11

A SURVEY OF HARDWARE SOLUTIONS FOR MAINTENANCE OF CACHECOHERENCEIN SHARED MEMORY MULTIPROCESSORSMilo TomaievZ

Department of Computer EngineeringPupin InstitutePOB 1511000 Belgrade, Yugoslaviaemail:etomasev@yubgefi1 bitnetAbstract

Appropriate solution to the well-known cache coherenceproblem in shared memory multiprocessors is one of the keyissues in improving performan ce and scalability of these sys-tems. Hardware methods are high& convenient because oftheir transparency fo r sof ha re. They also offer goodp er-fonnance since thty deal w ith the problem filly dynamically.Great variety of schemes hm been proposed; not mrmy ofthem were implemented. Two md or group s ofhardware pro -tocols can be recognized: directory and snoopy. This surveyunderlines the ir princip les and summarizes a elatively largenumber of relevant representatives j?om both groups. Co-herence problem in multilevel caches is also briejly consid-ered. Special attention is devoted to cache coherence main-tenance in large scalable shared memory multiprocessors.1. Introduction

Multiprocessors are the most appropriate type of com-puter systems to meet the ever-increasing needs for morecomputing power. Among them, sharedmemory multiproces-sors represent an especially popular and efficient class. Theirincreasing use is due to some sigmiicant advantages that theyoffer.The most important advantage is the simplest and themost general programming model, which allows easier devel-opment of parallel software, and supports efficient shanng ofcode and data. However, nothmg comes for fiee. Sharedmemory multiprocessors sufFer from potential problems inachieving high performance, because of the inherent conten-tion in accessing shared resources, whch is responsible forlonger latencies in accessing shared memory.htrodwtion of private caches attached to the processorshelps greatly in reducing average latencies (Figure 1). Cachesin multiprocessors are even more useful then in uniprocessors,since they also increase effective memory and communicationThis research was partially sponsored by the NCR Corporation,Augsburg, Germany, and the NSF of Serbia, Belgrade, Yug osla ia TheEhDOT simulation took were donated by the ZYCAD Corporation,Menlo Park Calfomia. USA. and the TDT Corporation, ClevelandHeights. Ohio, U S A . The MOSIS compatible W design took wereprovided by the T m orporation,Pasadena. Calrfomia, USA. heW design took were provided by the ALTER.4 C o p r a t i o n ,pmonto??ara. California, U S A .

~ proceedings oftheHawaii International Confirericeon system Sci-ences,K o k .Hawaii, USA .,January 54.1993.

Veljko Milutinovi6School ofElectrical EngineeringUniversityofBelgradePOB 81611000Belgrade, Yugoslaviaemail:milutino@pegasus. h

bandwidth. However, ths solution imposes another seriousproblem. Multiple copies of the same data block can exist inMerent caches, and ifprocessors are allowed to update freelytheir own copies, inconsistent view of the memory is immi-nent, lea- to program malfunction.This is the essence ofthe well-known cache coherence problem [Dubo88]. he rea-sons for coherence violation are: shanng of writable data,process migation, andYO ctivities. System of caches is saidto be coherent if every read by any processor h d s a valueproduced by the last previous write, no matter which prows-sor performed it. Because of that, system must incorporatesome cache coherence mainhaace mechanism which con-sists of a complete and consistent set of operations (foraccessing shared memory), preserves coherent view of thememory, nd ensuresprogram execution with correct data.Cache coherence problem has attracted considerable at-tention through he lastdecade. A lot of research at prominentuniversities and companieshasbeendevoted to that problem,resulting in a number of proposed solutions. Importance of theproblem is emphasized by the fact that not only the cachecoherence solution is necessary for correct program execution,but it can have a very sigmiicant impact on system perform-ance, and it is of utmost importance to employ cache coher-ence solutions as efficiently as possible. It is firmly provedthat the efficiency of cache coherence solution depends o thegreat extent on system architecture parameters and especiallyon parallel p r o m characteristics Pgge891. Choice of ap-propriate solution is even more critical in large multiproces-sors where some inefficiencies in preserving the coherencemaintenance are multiplied, which can seriously hurt thescalability.

A4 - memorymoduleC - rivatecacheP - rocessorICN- interconnectionnetwork

6 6Figure1: A sharedmemoxy multiprocessorsystem with pri-vate caches

8630-1060-3425/93$03.00@ 1993 IEEE


2/11

2. Cache coherence solutionsBasically, all solutions to the problem can be classi6edinto two large groups: software-based and hardware-based[StenW]. This is a traditional classi6cation that s t i l l holds,since the proposed solutions (so far) follow fully or predorm-

nantly one of the two appaches . Meanwhile, solutionsusingcombination of hardware and software means become morehquent and promising, so it seems that this classificationwillbe of lessuseinthehture.2.1. !Wtware solutions

Software-based solutions generally rely on the actions ofwith the coherence problem. The simplest but the mostrestrictive method is todeclarenoncacheablepages of shareddata. More advanced methods allow the caching of shareddata and accessmgthemonly in critical sections, in a mutu-ally exclusive way. Special cache managing instructions areo h sed for cache bypass, flush, and indiscriminate or se-lective invalidation, in order to maintain coherence. Decisionsabout coherence related actions are oftenmade statically dur-mg the compiler analysis (which tries to detect conditions forcoherence violation). There are also some dynarmc methodsbased on theoperating system actions.Software schemes are generally less expensive than theirhardware counterparb, although they may require consider-able hardware support- It is also claimed that they are moreconvenient for large, scalable multiprocessors. On he otherside, some disadvantages are evident, especially in staticschemes,where inevitable inefficiencies are incurred since thecompiler analysis is uuable to predict the flow of programexecution accuratelyand umservative assumptions have to bemade. Software-based solutions will not be elaborated upon

2.2. Hardware solutiomWe have focused our attention onthe hardware-based so-lutions, usually called cache coherence protocols. Althoughthey require an i n d ardware complexity, their cost is

well justified by siguificant advantages of the hardware-basedapproach.Hardware s c h e s deal with coherence problem by dy-namic recognition of inconsistency conditions for shareddata entirely at nm-time. hey promisebetter performance,especially for higher levels of data shanng, since the coher-ence overhead is generatedonly when actual shanng of data

t a k e s place..Being totally transparent to software, hardware protocolsfree the programmer and compiler h m ny responsibilityabout coherence maintenance, and impose no restrictions to0 Various proposed hardware schemes efficiently support the111 ange fkom smal l to large scale multiprocessors..Technology advances made their cost quite acceptable,compafed to the system costs. Due o aforementioned rea-sons, hardware cache coherence protocols are much more

investigated in the literature,and also much more fiequentlyimplemented in commercialmultiprocessorsystems.Essential characteristics of existing hardware schemes

the programmer, compiler, or operating system, in deidmg

any further in thispaper.

any layer of software.

are reflected in respect to the following criteria:

Where the status information about data blocks is held andWho s responsible for preserving the coherence in the sys-

0 What kind of coherence actions are applied,Which write strategy is accepted,Whichtype of notification is used, etc.Acmrdmg to the most widely accepted classificationbased on the first and the second criterion, hardware cachecoherence schemes can be principally divided into two argegroups: directoq and snoopyprotocols.In the next two sections, basic principles are presented,and numerous solutions belonging to both groups are su r -veyed. After that, cache coherence in multilevel hierarchies isdiscussed. Fmally, employing of previous methods in largescalable multiprocessors with cache-mherent architectures isconsidered.This survey follows one specific organizational structure.Each approach to be considered here is briefly introduced.Then, different examples are presented. Whenever appropri-ate, the following points of viewwillbe underlined:

0 Research environmentEssence of the approach0 Selected detailsof the approach0 AdvantagesDisadvantages0 Specialcomments performance,etc.).not exhaustive,becauseof space restrictions.

how it is orgamzed,tem,

This survey tries to be as broad as possible; however, it is

3. Directory protocolsThe main characteristic that distinguishes this group ofschemes is that the global system-wide status informationrelevant for coherence maintenance is stored in some kind ofdirectory. The responsibility for coherence preserving is pre-dominantly delegated to a centralizedcontrollerwhich is usu-ally a part of the main memorycontroller.Upon the individualrequests of the local cache controllers,centrakedmtrollerchecks the directory, and issues necessaqcommands for datatransfer between memoxy and caches, or between caches

themselves. It is also responsible for keeping status informa-tion up-to-date, so every local action which can affect theglobal state of the block must be reported to the centralcon-troller.Besides the global drrectorymaintained by the centralcontroller,private caches store some local state informationabout cached blocks. Directory protocols are prrmanly suit-able for multiprocessors with general interconnection net-Works.Paper [Agar891introduces one useful classficationof di-rectory schemes denotmg them asDir.X,where i is the num-ber of pointers, and X is B and kB, for broadcast andno-broadcast schemes, respectively. The global directory canbe orgamzd in several ways. Am- to that issue, direc-tory methods canbe dividedinto threegroups: hll-map direc-tory, imited dmctory, and chained directory schemes.3.1. Full-map di rectory schemes

The main characteristic of theseschemes s that the direc-tory is stored in the mainmemory, and contains entries foreach memory block. An enby points to exact locations ofevery cached copy of memory block, and keeps its status.Us-

864


3/11

ing this inf-tion, coh- of data in private caches ismaintainedby di rectedmessages o known ocations,avoidmgusua lyexpensive broadcasts.The Grst protocol ftom this class is developed invalidation approach and allows the existence of multiple un-modified cached copies of the same block in the system, butonlyanemodifiedcopy. MemOrydireCtoIy is orgamedas aset of copies of all individualcache directories. Such an or-ganization implies some serious disadvantages. Directory

the status of a particularu s t b e c h e c k e d w h e n d e & qmemory block. Also, information in memory directoly andcached rectorieshas o be consistent all t hetim.Classical full-map directory scheme proposed in[Cens78] applies the same coherence policy, but p r o v i d e s amuch more efficient directory search. Directory contains en-tries for each memory block in the form of a bit vector(Dir,NB scheme). Directoly entryconsists ofN + bits: onepresence bit per each of N processorcache pairs, and one bitwhich denotea whether the block is modified in one of thecaches (Figure 2a).Local cache dkctories store two bits percache block (validandmodified bits).A very similar directory scheme is described n [Yen85].Memory directory is identidy orpmad, but cache entrieshaw an additional bit per block, whi& is set in the case ofclean exclusive copy. This eliminates the need for directoryaccessfor writehits to umnodified private blocks, but impliestheburden formaintainingcorrectvalueof th i s bit.The main advantageof the U- map approach is that lo-catingnecessary cached copies is easy,and only caches wthvalid copies are nvolved in coherenceactionsfor a particularblock Because of that, they deliver best perfbrmance of al ldirectory schemes. There are couple of dwdvantages, too.Centralizd controller is innexible for system expansion byaddmg new processm. Also, these schemes are not scalablefor several reasons. Since all requests are directed to the cen-tral dmckny, it can become a p e r f m c e bottleneck. The

[Tans761and later impleme&d in IBM3081. It use^an in-

search is not easy,because duplicatesof al l cache dhctolies. .

most serious problem is s@cant memory overhead forU-map directory storage in multiprocessor systems with alarge number of processors. One approach to alleviate thsproblem is presented in [OKra90]. The proposed sectoredscheme reduces directory size by increasing the block size,while choosing subblock as hecoherencyunit.3.2. Limited directorg she-

Motivationto oapewith theproblemofay verheadin full-map direCtoIy schemes ed to centdized schemes withpartialmaps or limiteddirectories.They replace the presencebit vector with a small number of identifim pointing tocached copies Figure 2b). Conditionfor storage efficiency ofgiven by i lo&N


4/11

ance degradation for higher levels of sharing and a greaternumberofprocessors canbe intolerable.map approaches is given in [OKra90]. Tags of two Wkrentsizes areused. Small tag contains a limited numberof point-ers, while large tag consists of the full-map bit vector.Tagsare stored in tw o associative caches and allocated whenneeded.On the first reference, a small tag is allocatedto theblock, but when small tag becomes insufticient, it is fkd,and a large tag is allocated. When no tags are he , one ofused tags is selected, corm- block is invalidated, andmemory overhead for directory storage, is quite good; how-istics of parallel applications.

An inww solution which combines paltlal and full

the ag is reallocatedThe scalability of limited directory protocols, in terms ofever,theirperfannancehea~dependsonsharingcharacter-

3.3. Chained directory schemesAnother way to ensure scalability of directory schemes,with respect to tag storage efficiency, is the introduction ofchained, distxibukd directories.t is important that the ap-proach does not limit the number of cached copies.Entries ofsuch a directory are orgamzd m the form of l inked lists,where all caches shanng the same block are chained throughpointers into one list (Figure 2c). Unlike two previous direc-

tory approaches, chained directories are spread across the in-dividual caches.Entry in mainmemory is used only to pointto the head of the list and to keep the block stab. Requestsfor theblock are issued to the memory, and sulxsequent com-throughthe list,using the pointem.Distributed,chained direc-tories can be organized in the form of either slngly or doublyl inkedlists.Distributed directories implemented ass@y linked listsare used in the Stanford Distributed-Ihctory Protocol[Thap90].Main memory contains the pointer to the head ofthe list for a particular block, and the last member of the listcantainsthechainterminator.On a read miss, the requester isputon he head of the list and obtains the data f h m the previ-oushead. On mite request, nvalidation signal is forwardedh m the head through the intermediate members to the tail.Replacement of chained copy in a cache is handled by invali-dation of the lower part of the list. Completion of actions isacknowledged with reply messages.posed in the SCI (Scalable Coherent Interface) project[Jamego]. Each list entry has a forward and a backwardpointer. In this way, the replacement problem is alleviatedbecause an entry can easily be dropped by cha mq its prede-cessor and its successor.An entry is added to the head of thelist,by updating the pointers for both, he new entry and theold head entry. Some optimizationsare incorporated into thecoherence mechanism. Oneof the most important is the pos-sibility of combining the requests for list insertions. Coher-

Themainadvantage of chained directoryschemes is theirscalability, while perfoxmance is almost as good as infuU-map direotoly schemes [ChaiW]. Because of better han-d l q of the eplacement situation, doubly linked lists performslightly better COIllpared to s+y l i nked lists, at the expenseofbemg more complex and using twice as much storage for

mands fromthe memory controller are usually forwarded

A distribUted directory with doubly l i nked lists is p r ~ -

enceprotocol can also be bypassed for private data.

pointers.

4. Snoopy protocolsIn ths pup of hardware solutiom, the centralized con-troller and the global state information are not employed.Coherence maiotenanCe is based only on he actions of local

cause of that, all the actions for the currently shared blockmust be announced to all other caches, via broadcast capabil-ity. Local cache controllers are able to snoopon he network,and to recognize hd actions and um di ti m for coherence vio-lation, which imposes some reactions ( a c cow to the util-ized protocol), in order to preservecoherence. Snoopy proto-cols are ideally suited for multiprocessors which use sharedbus asglobal nterconnect,since he sharedbus provides veryinexpensive and fast broadcasts.They are also known asverycosteffective and flexible schemes. However, coherence ac-tions on the shared bus additionally increase the bus traffic,and make the bus saturation more acute. Consequently,onlysystems with s d o medium number of processors can besupportedby snoopyprotocols. Two write policies are usuallyapplied in snoopy protocols: writeinvalidate and write-update(orwrite-br0aW).

4.1. Write-invalidate snoopyprotocolsWrite-invalidate protocols allow multiple readers, butonly one writer at a time. Every write to a shared block mustbe precsdedwith the invalidation of all other copies of thesame block, preventing the use of stale copies (Figure 3a).

Oncetheblock is made exclusive, cheap local writes canpro-ceeduntil some otherprocessorrequires he same block.The approach originated fromthe WTI protocol, wherewrite-through to memoryon system bus is followedby invali-dation of all copies of the involved block. This simple mecha-nism can be found in some earlier machines. Such an m o -ance because of a very high bus utihtion [Arch86]. Onlysystems with a very small number of processors can use thisscheme.Another very simple scheme,the SCCDC protocol, pro-hibits cam more than one copy of the block Wend891.Every read and write m i s s takes over the copy and makes itprivate.Thismechanism iscosteffective for applications withprevalent sharing in the form of migratory data and synchro-nization variables.The Wnte-Once protocol was the one of the firstmite-back snoopy schemes. It was intended for single-boardcomputers in the Multibus environment [Good83].The actionon first write to a shared block is the same as n the WTI pro-tocol, leaving the block in the spec ih reserved state. Sub-sequent write hits to that copy (in dirty state) proceed lo-cally, until some other processor requests the block. Anotherimprovement refers to the ability that read misses canbe serv-iced by cache which holds the duly block.An ownership based protocol is developed for the Syn-apseN+1, a fault tolerant multiprocessor system [FranM].Abit for each memory block determines whether memory orcache is the ownez of the block, andprevents raceCondifioIIS.Cache ownership ensures cheap local accesses. When cache isthe block owner, read m i s s is inefficiently handled, becausememw has to be updated first, and the read request resub-mitted then.Another disadvantage is that write hit to shared

cache controllersand distributed local state information. Be-

phi~ti~atedoherence solution results in p00r system perform-

866


5/11

data produces the same action as a write m i s s , since there isno nvalidation [email protected] le y protocol, implemented in the SPUR multi-processor,also applies the ownership principle, but improvessignificautlyover the previously described protocol Dk851 .Efficient cache-bcache transfers without memory update areenabled by the existenceof the new shared duty state. Themain of the protocol is its inabilityto recognizea possible exclusive use, when fetchmg a missed block. Amore sophisticated version is also proposed, which uses so&ware hints to distinguish between loads of shared andnon-shareddata.The problem of rec+zhg the shanng status on blockload is solved in the I h o s protocol [papa84]. Pri vate dataare better handled,since he protocol entirely avoids invalida-tions on Write bits to unmodi f i ednon-shared blocks. To thisend, he new exclusive unmodihd state is intrcduced.Onread miss , eveq cache with a valid copy tries to respond,andto supply data. This requires correct arbitration on the sbaredbus. If theblock is modified,only one cacheresponds.Simul-taneously with the cache-tocache transfer, a memory updatetakes place, which can slow down the action.n

P l A D X

The CMU RB protocol presented in [Rudo84] tries tosolve the main obstacle for perfbmance improvement ining block validation. When some cache issues bus read, allcaches with invalidated blocks catch the data available on hebus, and update their own copies. This readbmdcast cutsdown he number of invalidationmisses to one per invalidatedblock. Only three states are used to spec^the block status.Blocks with oneword length are assumed., so space locality inprograms cannot be exploited in a proper way. In spite of thegeneral uselkhess of the read-* action, there is anegative side effect, which is the ncrease of cache inMer-ence reflected in high processor lockout fromthe bus

Many of thepreviouslymentioned useful features are in-cluded in the EIP (Efficieat nvalidation Protocol) [Arch87l.Besidesduty owners, hs protocol proposes another new fea-ture-clean cacheownership. When no duly block owner ex-ists,a clean block owner s defined, as the ast cache that ex-perienced readm i s s for the block, unless the block is repladlater. It makes handling of read and write misses to most un-modified blocks more efficient compafed to protacols in

write-invalidate protocols-invalidation misses-by i nt roduc-

~ 1 ~ 9 3 9 a i .

oo,I

V V V0 - 1 r ... 0 -c; c, G

a)X

Figure3 Write strategy n Snoopyprotocols:a) invalidationpolicyb) upaate Policy

P - paateddatablockA4 - haredmemory Y - alidbitC -cachememory -- istributedwriteX - atablock - - - nvalidationsignal

867


6/11

which only memory canbe the clean owner.TheEIP protocolalso applies the conceptof block validation as n the RB pro-Instead of the, usual ful block invalidation, the WIP(Word Invalidationprdoool) employs a partialword invalida-tion [Tama92]. It allows the existence of invalid words withinthe prhally valid block, until the pollution point is reached.AAer the invalidation hreshold is reached the whole block isinvalidated. This f ine-grain invalidation approach avoids

zation. Read miss to a partiay.valid block is serviced by ashorter bus readword operation, which gradually recovezs theblock Write miss on he only invalid word in a padallyvalidblock is also optimized.Performancestudies show that thewrite invalidateproto-cols are good choice for applications characterized with se-quential pattern of shanng, while fine-grain sharing can hurttheir performance a lot [Egge89].4.2. Write-update protocols

Write-update schemes follow a dstri buted write a pp c h which allows existence of multiple copies with writepermission. The word to be written to a shared block isbroadcast to al l caches, and caches containing hat block canupdate it (Figure 3b). Write-update protocols usually employa special bus line for dynamic detection of the shanng statusfor a cache block. This line is activated on the occasion ofdistributed write, whenever more thanone cached copy of theblock exists. When shanng of a block ceases, that block ismarked as private and the Wbuted writes are no longernecessary. In this way, write-through is used for shared data,and write-back for private data.A typical protocol fkom this group is theFirefly [ThacBS],which is implemented in the DECs Firefly multiprocessorworkstation. Only three states are sufficient to describe theblock status, since the invalid state is not requmd. Memorycopy is also updated on word broadcast.No clean owner ex-ists and all caches with shared copies respond to read miss ,supplying the same data in a synchronizedmanner.A very similar wherence maintenance is applied in theDragon protocol FlCcr841, which is developed for theDragonmultiprocessor workstation fiom Xerox PARC. hemain difference with the previous protocol is in thememoryupdate policy. Dragon doesnot update memoiy on distributedwrite, resulting n theneed for clean owner (in respect to othercached wpies) described by an additional state. The owner isthe last cache that performed a write to a block Theonly situ-ation which requires memory update is the eviction of theowned block fromthe cache. Avoidmg hq ue nt memoryup-dates on shared write hits is the source of performance im-provement over theFireflyprotocol, especially for higher ratescols canbe caused by process migration. Allowing the proc-when logically private data become physically shared. In thatsituation, actual shanng increases and unnecessarywherencyoverhead is i nduced. A Dragon-like protocol is proposed in[Pret90] which tries to cope with this problem. It introducesone bit with each block to disuiminate copies referencedbythe current and the previousprocess, called used and unusedcopies. Comparing this identdier with the running process

tocol. Implementatian of the EIP requi res an i ncreasedhard-ware complexity.

some urmecessary invalidationsand a c h i m better data utili-

of sharedref2zncs [Arch86].serious perfolmtmce degradation of write-update proto-esses o be switchedamongprocessorspromices false shanng

identZer, upon each processor operation or bus transaction,unusedwpiescanbedetectedandeliminated.Additional busline is also requiredfor transferring of the dirty ownership. Inths way, some unnecessarywrite broadcasts can be avoidedand effects of cache intedkmmereduced.isbetter suitedfor applications with tightexshanng. In case ofsequentialsharingand processmigration,performancecanbeseriouslyhurt with ikquenttmwcewqwritebroadcasts.

43. Adaptive protocolsEvidently, neither of the two appmaches is able to &liversuperior perfinmance across all types of workloads. This is

bine invalidationand update policy in a suitable way. Theystart with write broadcasts, but when a longer sequence oflocal writes is encounteredor predicted, the invalidation sig-nalfor the block is sent. These solutionscanbe regardedasadaptive, since they attempt to adapt the wherence mecha-nisms to the obserwd andpmhcted data use, in the effort toachievetheoptimalperfomance.One of the first examples of combined policy can befound in the RWB protocol [Rudo84], which is an enhance-ment of the former RE3 protocol. The first write to sharedblock is a write-through, which updates other cached copiesand memory, as well. On the second successive write to ablock by the same processor, an invalidation signal is sentonto he system bus, making the own copy exclusive. It seemsthat invalidation threshold of this protocol is too low. Also,other copies need not to be invalidated if used only for read-

The combination of invalidate and update policy, accord-ingto therew pmdwri te accesspattem, is the essence ofthecompetitive snooping approach [Kar186]. Coherence mainte-reachesthe total cost of invalidation misses for all processorspossessing the block, and then switches to write-invalidates.Coherence overhead stays within the hctor of two, wmparedto the cost of the optimal off-line algorithm. Whether thisapproach bnngs a performance mprovement or not is highlydependent on the sharing pattem of the particular parallel ap-plication. A couple of variants of the approach are proposed,as well asther implementation details.A similar principle is followed in the EDWP protocol[Arch88]. A fixed criterion is set for swi- hpdatepolicy to invalidate policy. After three distributedwrites by as q l e processor,unintempted with a reference by any otherprocessor, all remote cached copies will be invalidated. Thiscriterionseems o be a reliable indicator of local block usage.Like the EIP, this protowl also allows cache to obtain cleanownership of the block, improving the efficiency of miss serv-ice. Block status is defined with eight possible states. Addi-tionalbuslinefordeterrmnation of the existence of duty own-ersisalsoneeded.Sincea wide diversity of wheremx protowls exist in thisarea,a strong need for standardized cache coherence protocolhas appeared TheMOESI class of compatiblecoherence pro-tocols is introduced in [Swea86]. The MOESI model enwm-passes five possible block states, according to the validity,exclusiveness, and ownershipstatuses.Most of the existingwith some adaptations. This class is very flexible, becauseeach cache in the systemmay employ a different coherence

On the contrary to write-invalidate protocols, h i s group

the reason why some protowls have been proposed hat com-

ing.

nance starts with write-updates UnGlther cost in cycles spent

protocols m n this class, either as originally- or

868


7/11

and write policy fromthe class at the same time.tumbus standard bus pvides all necessary signals for theimplementation ofthismodel.Consistent ~r yi ew is still guaranteed The I EEE Fu-

4.4. Lock-bPsed protocolsSynchronization and mutual exclusion in accessingshareddata is the topic very closely related to c o w . Most

of the snoopy coherence schemes enforce strict consistencymodeland donot address these issues explicitly. Thecommonway to achieve exclusive access during write to shared data isusing the system bus as semaphore. Meanwhile, some a pproaches h d t useful to drectly support synchronizationprimitives along with the mechanism for Preserving cachecoherence. We willmentionheretwoofthem.The protoool proposed in pita861 isa typicalexampleofa lock-based protocol. It supports efficient busy-wait locking,waiting andunlocking, without use of the test-and-set opera-tion. Two additional states dedicated to lock h d h g are in-troduced:one for lock possession and the other for lock wait-ing. This mechanism allows the lock-waiter to work whilewaiting. Unlock is broadcast on system bus, so the snoopbusy-wait register of waiter can it upon match.Then, it obtains the lock after priontmd bus arbitration, andinterrupts the processor to use the locked data. In t b s way,unsuccessll retries for lock acquiring are completely re-movedh he bus which is a critical s y s t e m resource. Themaindisadvantage of ths scheme is that it does not differen-tiate between read lockandwrite lock requests.described n [Lee90], which combines coherencestrategy withsynchronization. Waiting for a lock is orgamzd throughdis-tributed, hardware implemented FIFO queues.Caches whichtry to obtain lock arechained in a waiting list. The tag field ofeach cache entry, besicks state, contains a pointer to the nextwaitmg cache and the ownt of waiters in the peer pup.Theadvantageoverpreviousprotocols is that shared and exclusiveaccess for read lock requesters, which improves performance.This protocol has 13 possible block states, and incurs a sig-nificant additionalhardwarecomplexity.

Lock primitivesare also supported in the snoopy protocol

locks can be dismgwhed. shared locks allow simultaneous

Figure4: Multilevel cache Organizations:a) privateb)multiportsharedc) bus-basedshared

5. Coherence in multilevel cachesHaving inmind the growing disparity betweenprocessorand memq speed, cache memories become more and moreimportant. Meanwhile, single level caches are unable to suc-

large enough. Multilevel cache hienuthy seems to be the un-avoidable solution to the problem. Lower levels of such ahierarchy aresmaller but faster caches. Their task is to reducem i s s latency. Upper level caches are slower but much larger,in order to attainhigher hit ratio and reduce the traffic on theintemmmction network. Specific extensions of exis t ing pro-tocols areneeded o ensure coherence n multilevel caches.Among many successful organizations of multilevelcaches, three of them appear to be the most important-881. Thek s torganization is an extension of s q l e evelcache, where every processor has its private hierarchy ofcaches (Figure 4a). Besidesprivate first level caches, the othertwo organizations introduce shared caches on upper levels,common or several processors, but with a different accessstructure. In the second scheme, the upper level cache is mul-tiported between the lower level caches (Figure 4b), while inthe thrd organhation, the upper level cache is accessedthrough shared bush he lower level caches (Figure 4c).Since coherence maintenance is aggravated in multilevelcaches, it is necessary to follow the principle of inclusion, tomake it more efficient. This principle implies that the upperlevel cache is a superset of all caches in the hierarchy belowit. In ths way, coherenceactions towards lower levels are fil-tered, and reduced only to the reallynecessary ones, oweringthe cache interference. This gain usti6es some space ine5-ciency incurred by inclusion. Applying inclusion to thesetypes of cache himchies shows that the conditions for sim-ple coherencemaintenancecan easily be met in the first typeof organizaton,and that the thrd organization can also be anattractive solution,because of its cost-effectiveness.A specific two-level cache hierarchy is proposed in[Wang89]. Every processor has a m t d y addressed firstlevel cache, and a physically addressed second level cache,with write-back onboth levels.Virtual addressing on he firstlevel makes it faster because of avoidmg address translationin the case of hit, although new problems like hand@ of

cessfully fuml the two usual requirements:to be fast and

b)A4 - emorymoduleC , - ustlevelcacheC , - econd evelcacheP - P - S Q r

SB- systembus

CB- clustabus

869


8/11

synonyms are ntroducsd. t is solved by maintainingpointerspart of an extended nvalidation protocol forpnxerv@ coher-between copies in the first and the second level caches, as aence. Inclusion property is al so applied. In order to maeasememory bandwidth, splitting the first level cache into theinstruction andthedata part is advocated.6. Cache coherence in large shared memorymultiprocessors

One of the most required characteristics of hardwarecache coherence solutions is the scalability, which depicts theability to support efficienty large-scale shared memory multi-processor systems. In strive for more processing power, abroad range of various architectures has been proposed forthose systems -1. Same of themtry to retain the advan-tages of common bus systems and overcome scalability lim-tations by introducing multiple-bus architectures. Appropriatesnoopy protocol extensions are the key issue for these sys-tem. other systemsareoriented t owards general intion networh and --based schemes.A firesuent soh-tion is to organize systems into bus-based clusters or subsys-t emwhich are connected with some kind of network Canssquently, the schemes that combine principles and advantagesseemto be highly effectivefor these systems.Wisconsin Multicube is a highly symmetric bus- basedshared memory multiprocessor which supports up to 1024p rocesso r s [Good88]. Processing nodes are placed in the in-tersectionsof busses and memory modules are tied to columnbusses. Each node ncludesdedicated two-level caches. Small6rst level cachekmmiss latency, while a vey largesecond level cache is intended to reduce bus traffic. It is con-nected to both row andcolumn busses and is al soresponsiblefor coherence maintenance by snooping on both busses. Theinvalidation protocol issuesat most twice asmany bus opera-tions onmultiple busses in order to acquire blocks, comparedto smgle bus systems.The same topology is applied in theAquarius m u l t i p -essor [Carl90], but it differsh he Wisconsin Multicubeinthe memory distribution and the cache coherence protocol.Memory is completely distributed across nodes which alsocontain large caches. Coherence solution combines features of

of both snoopy and directory protocols on different levels

..-.-.. : M -globalmemorymoduleM j *.. : .I# ic..__._____..____......ZCN - lobalintercon-nectionnetworkcc - luster~trollerCB - lusterbusLM - ocalmemoryC -cachememoryP - p s o rFigures: Hierarchical organization of a large-scale sharedmemory mukipmessor

snooping and directory schemes. Snooping principle main-tans coherenceof caches on ndividual busses, while direc-rectories for memory blocks reside in large local memorymodules.A quite diffmnt multiple-bus architecture is proposed in[Wils87]. Caches and busses arehierarchicallyorganmd in amultilevel tree-like structure. Lower level caches are privateand COMected to local busses, while caches on higher levelsare shared. ApproPriate invalidation snooping solution(extendedandmodified Wri- protocol) is employed forc o w aintenance. Following the inclusion property, itcan flter the invalidation actions to lower level caches andrestrict c o h y actions only to sections where it is neces-sary. Sigrdicaut raffic reductioncan be attained inthisway.A cluster architecture with memory modules distributedamong clusters is also proposed This approach is applied intheEncores GigaMaxproject. Forthe same type of organiza-tion, a somewhat different coherence solution, which accountsthe type of sharing on various levels, is proposed in[DeWa90].Since process allocation which results in tightersharing on ower levels and looser sharing on higher levels isnaturally expected, snoopy write-update (modified Dragon)protocol is used inside the subsystems, and write-invalidate(modified Berkeley) protocol isusedon the global evel.

Another protocol proposed for hierarchical two-level bustapologv with clusters canbe found in [Arch88]. The protocolincorporates selected distributed features of snoopy protocolsand consists of an &emd anda global portion.ous one also characterizes the Data Diffusion Systemm 8 9 1 . Very large processor caches implement virtual

torymethod is usedtopreserve coherenceamong busses. Di-

The hierarchical cluster architecture similar to the previ-

sharedmemory.Hierarchical write-invalidate ptocol is usedtion Clusters are connected to hlgher levels throughdata con-cache coh- ~ ~ l ~ t i o nhich fuUy supports data m ip -trollers. That is where a set-associative directory for datablocks on the ower level resides. The controlleralso monitorsthe activitieson hebusses in asnoopingmanner.Stanfords DASH distributed shared memory multipm-essor system is composed of common-bus multiprocessornodes linked by scalable inkrwect of general type[LenogO]. It is one of the first operational machines with ascalable cache coherent Each node contai ns pri-vate caches and employs a snoopy scheme for cache coher-ence. System-wide coherence is maintained using a direc-tory-basedprotocol, which is independmt of the network type.There isno global directory since it is par t i t i d and distrib-uted across nodes,as well asmemq. Among other meansfor memory access ophii ation, the DASH supports a morerelaxed release consistency model in hardware.So far, hardware and software solutions are most of thet i me developed mdependenttly of each other, each approachtrying o solve the problem only in its own domai n. However,we strongly believe that a very promising approach in attain-ing the optimal pedonnance is the complementa~~se ofhardware and softwaremeans.We expect that this is a direc-tion where realbreakthroughs are likely to happen.The MlTs Alewitie multiprocessor uses the L i m i U S Sp t w l which represents a hardware-basedcache coherencemethod supportedwith a softwaremechanism [Chaigl]. Di-rectory entries implement only a limited number of hardwarepointers, in order to be storage efficient, counting that it issufticient in a vast majority of cases.Exceptional chum-

870


9/11

stances, when more pointers are needed, re handled in so%ware. n hose idkquent cases,an intenupt is generated,anda full-map directary for the block is emulated. A fast trapmechanismprovides support or this feature.7. Conclusion

This survey tries to give a comprehensive overview ofhardwmbasedsolutions to the cache coherence problem inshared memory multiprocessors. Doubtlessly, this topic de-serves muchmore time and space, so we decided to provide alist ofpapers for additional readmg, on a number ofmerentsubjects related to the general topic of cache coherence inmultiprocessor systems. The great importanceof he problemand the strong impact of the applied solution on sys tem per-formance puts an obligationon he sys t em architects and de-signersto carefully considerthe aforementioned issues in theirpursuit towards more powexfd and more efficient sharedmemory multiprocessors. Despite of the considerable ad-vancementofthe ield, it still represents a very active researcharea. Much of the work in developing, implementing, andevaluating he solutions should be done with sigdicant pro-spective benefits.8. References[Agar891Aganval A., Simoni R., Hennessy J., Hmwitz U,An Evaluation of Directory Schemes for Cache Coher-ence, P rocd i n gs of the 16thISCA, 989,pp. 280-289.[Arch851Amhibald J., Baer J. L., An Econormcal Solution tothe Cache Coherence Problem, Proceedings of the 12thISC4, 1985, .355-362.[Arch861 Arc h i dJ . , Baex J. L., Cache Coherence Protocols:Evaluation Using a Multiprocessor Simulation Model,ACM Tmnsactim on ComputerSystems, Vo1.4, No.4, No-vember 1986,pp.273-298.[Arch87 Archibald J., The Cache Coherence Problem inShared-Memory Multiprocessors,PhD Thesis, University[Arcd$[h=J A Cache CoherenceApproach For LargeMultiprocam System, roceedings of the Supercomput-ing Confeerence,1988,pp. 337-345.paex881Baer J. L., Wang W. ., On he Inclusion Properties

for Multi-level Cache Hierarchies, Proceedingsof the 15hISCA,1988pp. 414-423.pita861 Bitar P., Despain A., Multiprousor Cache Synchro-nization: I ssues, Innovations, Evolution, Proceedings ofthe 13th ISCA, une 1986,pp. 424-442.[Carl901 Carlton U , Despam A., Aquarius Project, IEEEComputer,Vo1.23, N0.6, June 1990,pp. 80-83.[Cens78] Censier L., Feautrier P., A New Solution to Coher-ence Problems in Multicache Systems,lEEE Tronsact im[Chai90] Chaiken D., Fields C., K e ., Agarwal A:,?)lrectory-BasedCache Coherence m Larg&e Mulbj,EEEComputer,Vo1.23, N0.6, June 1990,pp.49-58.[Chai91] Chaiken D., Kubiatowitz J., AgarwalA., L h i t L E S SDirectories: A Scalable Cache Coherence Scheme, Pro-ceedings of the 18th ISCA, 991, pp. 224-234.81 Dubois M., Scheunch C., Briggs F.,

tiprousors, IEEE Computer, V01.21, N0.2, February1988,pp. 9-21.

February 1987.

on Copnp~ters,V0l.C-27, N0.12, December1112-1 118. 1978, pp.

Synchronizati~~~,oherence, and Event ordering in Mul-

[Egge89] Eggers S. J, Simulation Analysis of Data SShared Memory Mu- Report No. U89/5Ul, UniversityofCalifornia,Berkeley,April 1989.[Egg&9a] Eggers S., Kalz R, Evaluating Performanceof FourSnoopingCache Coherency Protocols, Pmceedings of thep+an84] Frank S . J., Tightly Coupled Multiprocam SystemSpeeds Memcny-access Times, Electronics, 57,1, January

1984,pp. 164-169.[Good831Good,J. Using CacheMemory To ReducePnx;essor-Memory Traffic, Proceedings of the 10th ISCA,1983,pp. 124-131.[Good881Goodman J., Woest P.,TheWisumsin Multicube:ANew Large-scale cachecoherent Multiprocam, Pro-ceedingsof the 15th lSC.4,May 1988,pp. 422431.m 8 9 ] Haridi S.,HagaskmE., The Cache CohexemeProto-col of the Data Diffusion Machine. Proceedings of the

16thBCA, 1989,pp. 2-14.

- -PARLE 89, Vol. 1, 1989,[ J d ] ames, D. V., Laun%i,-?T., G~essing,S., Sohi,G.S.. Scalable Coherent I nterf ace, Computer. June 1990.* -pp74-77.[Kar186] Karlin A., Mauasse M., Rudolph L., Sleator D.,Competitive Snoopy Caching, Proceedings of the 27thAnnual Symposium on Foundotim of Computer Science,1986,pp. 244-254.[Katz85]Kak R, Eggem S.,Wood D., Perkins C., Sheldon R.,a Cache Consistency Protocol, Proceed-%p1-3ngsof 12th I CA, 1985, pp.276-283.m 9 0 ] Lee J., Ramachandran U., Spchr(mization with Mul-tiprocessor Caches, h e e d i n g s of the 17th ISCA,1990,m 0 9 0 ] Lenoski D., Laudon J., Gharacholoo K., Gupta A.,Hemessy J., le Directmy-BasedCache Coherence Pmtocol for the DASH Multiprocessor, Proceedings of the17th ISCA, 990,pp. 148-159.WcC1-841 McCreight E. M., The Dragon ComputerSystem, nEarly Overview, Proceedings of the NATO AdvancedStudy Institute m Micnxamhitecture of YLSI Computers.

pp. 27-37.

Urbho, Italy, July 1984.mend891 MendelsanA., PradhanD. K., Slngh A. D., A S u eCODVata Coherence Scheme for Multi~rocessorSv s&,omputer Architecture New s, 1989,b.36-49.[OKra90] OKdka, B. W., Newton A.R., An Em i r idEvaluation of Two Mem -Efficient Directory M&&,Proceedingsof the 17th I% 1990,pp. 138-147.papa841 Papqarcm, U, atel, J.,A Low-overhead Coher-ence Solut~on or Multip~wessorswith Private CacheMemories, Proceedings of the 11th iSCA, 1984, pp.348-354.pet901 Prete C.A., ANew Solution of Coherence ProtocolforTighly Coupled MultipIwessor Systems,Micropmessingm d M i c r 0 r o g m i n g 30,1990,pp. 207-214.Pud0841 RudofphL., SegaU Z., Dynam~c ecentrakedCacheSchemes for MJMD Parallel Processors, Proceedings ofthe 11th ISCA,1984, pp. 348-354.[Sten90]S t e P., A Survey of Cache Coherence Schemesfor Mu- IEEE Computer,Vo1.23, N0.6, June[Swea86] Sweazey P., Smith A. J., A Class of CompatibleCache ConsistenCyProtocols and heir Support by the IEEEFuturebus, Proceedings of the 13th ISCA, 1986, pp.414-4423.[Tang761Tang C. ,CacheSystem Design intheTightly CoupledMultiprocessor System, Pmeedings of the National

Computer Co nf m ce , 1976,pp.749-753.-81 Thacker C., Stewart L., Satterthwaite E. Firefly: AMultiprocessor Workstation,IEEEMicro, February 1988,

1990,pp. 12-24.

pp. 5749.

87 1


10/11

- Thakkar S., Dubios M., aundrie A., Sohi G.,Scalable Shared-Memoq Multip~~cesmrchitechres,1, IEEE Computer, Vo1.23, N0.6, June 1990, pp.

[Toma921Tcamkvi6,M, ilutinovi6 V.,A Simulation StudyJanuary 1992,p p . 427436.[wang89] WangW. , Baer J. L. , Levy R,Orgmmhon andPerformance of a TwpLevel Virtual-Real Cache Herar-chy,Pmeedings of the16thISCA, 1989,pi ? ]weber w: ., ~slpta ., Analyans 3cache Iuvali-dahm PatternsmM u l ~ s s o r s , roceedings ofthe 3rd

Shared Memory Multlpr~~e~sors,roceedingsof the 14thISCA, 1987,pp . 244-252.Tyen851YenW. C., YenD.W. L., Fu D. .,Data CoherenceProblem in A Multi& System, LlGW TmnsoCron on

78-80.CohexnceProtocols, proceedingsof the

.140-148.A SP WS, April 1989,pp. 243-256.[WwqWilson A., W* CaChdBlls AlChi- for

CompurerS,(2-34, Jan~ary 985,pp. 56-65.9. Suggestionsfor further reading

Here is provided anadditionallist ofrehence (notmen-tioned in the paper because of space restrictim). It gives adeeper insight into the cache c o b b l m in sharedmemory multiprocessors. The papers cover a broad range oftopics in thegeneralarea ofcache cohereme,parallelprogramcharacteristics, analytic and simdation models of ptocols,[Adv&l] Adve S.,Adve V., Hill Id,Vermm M., Camparisonof Hardware and Software Cache Coherence Schemes,P m e e d h g s of the 18rh ZSC4,1991, pp. 298-308.

Technical R e p r t m S397, Juue 1989.chi= o q m z a t ~ o n s , rdocols and Performance," JoumalofPamlle1 andDisaibutedComputing, 6,1989,451476.[Cheo89] Cheong H, Veidenbaum, A Version Control A ppreach to Cache Coherence, Proceedings of the Interna-tional Conjknce on Supercomputing 89, June 1989, pp.

[Chdl] Cheriton D., Goosen R,Boyle P., Paradigm: AH q h l y Scalable Shared-Memary Multicomputer Archite~ture, Computer,Vol. 24, N0.2,F e w 991,pp. 33-46.[Cytrs8] CytronR.,Karlov S.,McAuliffe, Auto~naticMan-1988I P P , August 1988,pp. 229-238.[Dub082 DuboisU, riggsF., Effects of Cache Coherencyin

performanceevaluationand cOmpafiSOlZetc.

[Agar891 -.A.? Gupta.A., 9- a dMemory Referenw,p h l hcahty IIL Mdtl-91 Baer J..L, . W w W. H., Multile~lCache Him-

322-330.

agemeat of Programma e Caches, Proceedings of theM / i p , IEEE Tronsoctons on Computers,V0l.C-31, N0.11, November 1982,pp. 1083-1099.

p@8] Eggers S.,Katz R, A C . OfSharinginParallelPrqgrams and its Application to cohaency Prowcol Eval~on,Proceedingsof the 15thETUM ay 1988,@Zggg9] Eggas S.,Katz R, The Effect of Sharing on theCache and Bus Performance of Parallel Programs, Pro-ceedhgs ofthe 3 d A S P W S ,April 1989,pp. 257-270.[Good871Goodman J, CO- for Mdtl-r Virtual

Caches, Proceedings of the 2nd ASPLQS, 1987, p p .72-81.w 9 ] Min S.L., Baer J. L., A TipBa sed Cache Cc-herence Scheme, Proceedings of the 1989 ICPP, pp .123-132:[Owic89]ow1cki, .,Agarwal,A.,EvaluatingthePerf-of Software Cache cohe,roceedings of the 3rdA SP WS, Apnl1989, pp . 230-242.[%he871 ScheunchC., DupoisU,Correct Memory operationof Cache-Based Multl IS, Pmceedings of the 14thISC4, June 1987,pp.G.[Sites81 Sites R Agarwal,A. Multi Cache AnalysisUsing A Procedings~$75th LSCQ,May 1988,[Smits2] Smith, A. J., Cache Memories, Cmput ing Surveys,14,3, September1982 473-530.[Smit%S]srmthA. J., C&ache Consistency wth SoftwareSupport and Using One-Time Identifiers,Proceedings ofthe Pacific ComputerCormtunications Symposium,Seoul,

October 1987,pp. 153-161.[Sten891Stenstrom P.. A Cache Coherence Protocol for Multi-with Multistage Networks,Pmeedings of theG,ay 1989,pp.407415.part921 T d j a ., MilutinovikV.,AnApjxoach to-!Software CacheConsistency Maintenanceased on Con&-tional Iuvalidation,Proceedings of the 25th HICSS, Janu-[ThaP90]ThaparU,Dew B., Cache Coherence for SbaredMem Multilnmcesm,Pmceedingsof the 2nd AnnualA C M Yyn Pamllel Algorithms and Architec-turn, July 1990,pp. 155-160.p d ] e U, olliday U, Perfonn8nce Asalysis o f

eralized Timed Petri Nets, Proceedingsof the Pet fom-ance86andACMSigmetrics1986,M a y 1986, pp. 9-17.p~xn881 erqaq M., Lazowska, E., ma^!, A n Accurateand Effiaent Perfinmance+ym Techque for Multup--?? Protocols, Proceed-ingsof& 15th Z U % 3 0 % 3 1 S .[wood891 Woodbury P., Wilson A., ShemB., Gertner L, ChenP. Y.,Bartllet J., Aral Z., Shared Memory Multiproces-5ors:The Right to Parallelprocessing, P m e e d -ingsoftheSpring cnnpmn89,1989,pp . 72-80.peh831YehC., P a ., Davidsor~ ., Shared Cache forMul-ti le-stream computer Systems, ZEEE Tmnsactim onC%nputers, C-32(1), January 1983,pp. 3847.

.373-383

pp. 186-19s.

ar y 1992,pp. 457-466.

M u l t i 7 ache Consistency Protocols U- Gen-

872


11/11

A Survey of Hardware Solutions for Maintenance of CacheCoherence in Shared Memory MultiprocessorsM. omasevic and V.Milutinovic

Please see page 863for this paper

496

Documents

A Survey of Hardware Solutions for Maintenance of Cache Coherence in Shared Memory Multiprocessors