Moving address translation closer to memory in distributed shared-memory multiprocessors

Moving Address Translation Closer to Memoryin Distributed Shared-Memory Multiprocessors

Xiaogang Qiu and Michel Dubois, Fellow, IEEE

Abstract—To support a global virtual memory space, an architecture must translate virtual addresses dynamically. In current

processors, the translation is done in a TLB (Translation Lookaside Buffer), before or in parallel with the first-level cache access. As

processor technology improves at a rapid pace and the working sets of new applications grow insatiably, the latency and bandwidth

demands on the TLB are difficult to meet, especially in multiprocessor systems, which run larger applications and are plagued by the

TLB consistency problem. We describe and compare five options for virtual address translation in the context of Distributed Shared

Memory (DSM) multiprocessors, including CC-NUMAs (Cache-Coherent Non-Uniform Memory Access Architectures) and COMAs

(Cache Only Memory Access Architectures). In CC-NUMAs, moving the TLB to shared memory is a bad idea because page

placement, migration, and replication are all constrained by the virtual page address, which greatly affects processor node access

locality. In the context of COMAs, the allocation of pages to processor nodes is not as critical because memory blocks can dynamically

migrate and replicate freely among nodes. As the address translation is done deeper in the memory hierarchy, the frequency of

translations drops because of the filtering effect. We also observe that the TLB is very effective when it is merged with the shared-

memory, because of the sharing and prefetching effects and because there is no need to maintain TLB consistency. Even if the

effectiveness of the TLB merged with the shared memory is very high, we also show that the TLB can be removed in a system with

address translation done in memory because the frequency of translations is very low.

Index Terms—Multiprocessors, distributed shared memory, virtual memory, simulations, dynamic address translation, virtual-address

caches.

�

1 INTRODUCTION

INmodern processors, page tables and address translationcaches, commonly called TLBs (Translation Lookaside

Buffers), support the dynamic translation from virtual tophysical addresses. The TLB reduces the penalty of thetranslation. Currently, the TLB is part of the processor coreand is accessed before or in parallel with the L1-cache.Current technology trends and the growing working setdemands of applications put more and more pressure onTLBs. Previous studies have shown that the TLB servicetime can consume up to 50 percent of the user executiontime in some workloads [30], [31], [32].

The increasing pressure on the TLB comes from speedrequirement and scalabilbity issues. Being on the criticalpath of every instruction and data access, the TLB latencyand bandwidth requirements increase with the clock rateand instruction level parallelism [2]. The memory system ofmodern processors such as superscalar [39] or VLIW [13]processors must satisfy multiple memory and TLB accessesin every cycle. At the same time, the working sets ofapplications keep growing and changing. The issue of TLBscalability comes from the fact that the TLB is on chip andits size is fixed. It becomes a critical issue when theprocessor is integrated in a system where the size of other

components, especially main memory, can vary: the sameTLB must be used whatever the system configuration is. Ina multiprocessor, the effective amount of TLB does notincrease as fast as the number of processors because TLBentries are replicated. Moreover, TLB consistency must bemaintained and is costly in large-scale multiprocessors [33].

Virtual-address L1-caches relieve the latency and band-width requirements of TLBs [5], [6], [12], [16], [29]. Whenthe L1-cache is virtually indexed and tagged, most memoryaccesses are completed without TLB involvement. Scal-ability issues can be alleviated when address translation isdone at various locations in the memory hierarchy [36]. Theremoval of the TLB altogether is also a possibility, whichhas been proposed before [29], [16].

In this paper, we explore design options to reducedynamic address translation overhead and make it morescalable in large-scale distributed shared-memory multi-processor systems. The basic idea is to move the addresstranslation closer to memory.1 As we move addresstranslation closer to shared memory, there is a point wherethe TLBs are shared, do not have consistency problems, andcan scale well with both the memory size and the number ofprocessors. We also explore the elimination of the TLBaltogether as a possible alternative.

The basic competing architectures for large-scale dis-tributed shared-memory architectures are CC-NUMA andCOMA [35], although many optimizations have beenproposed to these models. In CC-NUMAs (Cache-CoherentNon-Uniform Memory Access Architectures), coherence is

612 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 7, JULY 2005

. X. Qiu is with Sun Microsystems Inc., 901 San Antonio Road, Palo Alto,CA 94303-4900. E-mail: Xiaogang@[email protected].

. M. Dubois is with the Department of Electrical Engineering Systems,University of Southern California, Los Angeles, CA 90089-2562.E-mail: [email protected].

Manuscript received 20 July 2003; revised 31 May 2004; accepted 10 Sept.2004; published online 20 May 2005.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-0117-0703.

1. Throughout this paper, we use the convention that memories close tothe processor are at the top of the hierarchy. Thus, the first-level cache is atthe highest level.

1045-9219/05/$20.00 � 2005 IEEE Published by the IEEE Computer Society

maintained at the cache level. While there is no logicalproblem to move the TLB to the shared memory in a CC-NUMA, the translation of addresses at the shared memoryis a bad idea because it precludes the migration andreplication of pages between nodes and it constrains thestatic placement of pages to physical memory. This isbecause the node where the page must reside is designatedby its virtual page number. Since processor node accesslocality is critical in distributed shared-memory multi-processors because of the nonuniform memory accesses,placing the TLB at the memory is not a viable option in CC-NUMAs.

By contrast, the main memories (called AttractionMemories) in the nodes of a COMA (Cache Only MemoryAccess Architectures) act as a global cache coherent system,enabling free migration and replication of memory blocks.The TLB can be placed at the shared memory withoutcompromising processor node access locality.

Because of this added flexibility afforded by the COMAorganization, we focus in this paper on COMA architec-tures, although some configurations evaluated in the paperare applicable to CC-NUMAs as well. We show that, in aCOMA, the concept of physical address may be eliminatedaltogether. All memory access transactions can be donewith virtual addresses throughout the system and addresstranslation becomes part of the cache coherence protocol.Our simulations show that the address translation overheadis dramatically reduced and scales well as the translationpoint is moved down the hierarchy from the L1-cache to thehome node (the node where the directory entry of a block islocated).

The remainder of the paper is structured as follows: Aftersome technical background in Section 2, the five designoptions for dynamic address translation are introduced inSection 3. Our evaluations of various schemes using execu-tion-driven simulation arepresented in Section 4.Weprovidean overview of related research in Section 5. Finally, Section 6is the conclusion.

2 BACKGROUND

2.1 Issues Related to Virtual Address Caches

Since most of the systems introduced and evaluated in thispaper rely on caches (and memories) accessed with virtualaddresses,we first clear a seriesof issues related toanysystemwith virtual-address caches. We briefly review the relevantissues, andgive the solutions adopted in thispaper andwhichhold throughout thepaper.Theseproblemsand theirpossiblesolutions are extensively discussed in [5] and [6].

Virtual-address caches have problems with synonyms,address mapping changes, and access right control.Synonyms happen when multiple virtual addresses mapto the same physical address and may cause inconsistenciesin a virtual cache. Virtual-physical address mappingchanges due to deallocation and reallocation of pageframesalso trigger cache flushes. Although there are manysolutions to these problems, we adopt throughout thispaper a segmented architecture such as in the 32-bitPowerPC [22], in which sharing is implemented at thesegment level. The problem of maintaining access rights

(such as read, write, and execute) in a virtual-address cache(instead of in the TLB) is also solved in a segmented systemsince access rights are checked at the segment granularity[16], [22] by storing access right bits in the segment registersaccessed before the caches.

Another issue is whether we should maintain inclusionbetween a TLB and the virtual caches or memories above it.Inclusion between a TLB and the cache above it is goodbecause it avoids TLB misses on write-backs from thevirtual-address cache. However, inclusion is expensive inlarge-scale distributed shared-memory multiprocessors.Whereas a large cache may cut the number of capacitymisses dramatically, bigger caches cannot filter out coher-ence misses and the longer address translation latencyimpacts coherence operations. Additionally, the TLB size tomaintain inclusion grows with the cache size, leading tohigher cost and longer latency.

It is not really mandatory to maintain inclusion. A simplesolution to the write-back problem is to keep physicalpointers in the virtual-address cache to avoid accesses to theTLB on a write back. Of course, it is always possible to takea TLB miss on a write-back.

In this paper, we do not enforce inclusion between a TLBand the virtual caches above it.

2.2 Late Detection of Memory Faults

After a virtual address is translated into a physical address,the processor assumes the memory access will completewithout fault (except for fatal errors) because the TLBcontains all the information that could trigger an exceptionin the memory hierarchy. The traditional TLB associatedwith the L1-cache thus provides a “safe-access subset” suchthat any memory access passing the TLB does not generateany exception in normal execution. However, this is aconservative strategy, in which every memory accessoutside the “subset” stalls the processor even though itmay not lead to a memory fault.

Moving the address translation point after a virtual cache(or memory) postpones the decision time of the memoryfault. The virtual cache and the following TLB expand the“safe-access subset,” which means that it takes longer todetect a page fault and a miss of the TLB as compared to asystem with physical caches. This is a potential issue withthe architectures proposed in this paper. However, it hasbeen shown that, in modern processors with prefetchingand release consistency, the late detection time of memorytraps is mostly hidden [26]. Moreover, helper threads canhandle such exceptions at very low cost in future, multi-threaded processors [10], [40].

2.3 COMA Protocol

In COMA machines, the main memory in each node acts asa cache called “Attraction Memory” and inclusion [36] isenforced between the local attraction memory and thecaches above it. In general, back-pointers to higher-levelcaches are maintained so that, in case of a replacement orinvalidation in the attraction memory, the caches above itare invalidated. This rule applies throughout the cachehierarchy.

Throughout the paper, we adopt the write invalidateprotocol of COMA-F [18]. Each block in the attraction

QIU AND DUBOIS: MOVING ADDRESS TRANSLATION CLOSER TO MEMORY IN DISTRIBUTED SHARED-MEMORY MULTIPROCESSORS 613

memory can be in one of four stable states: Shared, Master-Shared, Exclusive, and Invalid. Replacement in the attrac-tion memory is a problem because there is no dedicatedlocation to keep the last copy of a block. The replacementpolicy must make sure that the last copy of a block is notlost. A Master-Shared or Exclusive copy is responsible forholding the last copy. The replacement of such a block copysends the block to the home node, where the directoryresides. The home node accepts the injection only if it hasInvalid blocks in the same set. If not, the home nodeforwards the block copy to a random node. The selectednode accepts the injection if it has an Invalid or Sharedblock available. If it does not, it forwards the request toanother node. We have adopted this protocol in allevaluations in this paper.

3 DYNAMIC ADDRESS TRANSLATION IN

DISTRIBUTED SHARED-MEMORY

MULTIPROCESSORS

3.1 Dynamic Address Translation in CC-NUMAs

Address translation can be done at different levels of thememory hierarchy in a traditional CC-NUMA architecture,as shown in Fig. 1. For each location the memories to the leftof the arrow (above the TLB) are accessed with virtualaddresses while memories to the right of the arrow (belowthe TLB) are accessed with physical addresses. Mostprocessors translate virtual addresses in a TLB before orin parallel with the first-level cache (L0-TLB). However,provided caches are virtually indexed and tagged, the TLBcould be placed between the first and second-level caches(L1-TLB) or after the second-level cache (L2-TLB). In allthese cases, the TLBs are private to each processor and theirconsistency must be maintained.

Alternatively, the TLB could be associated with the homenode (SHARED-TLB in Fig. 1). In this case, the TLBs are

shared, map the local memory only, and do not causecoherence problems. However, because the location of apage is determined by its virtual address, the operatingsystem has no control over page location and pagemigration. Page placement is constrained by the virtualaddress and migration and replication of pages is thereforeimpossible. Because page placement cannot be optimizedfor locality either statically or dynamically, many cachecapacity misses will be remote, resulting in poor perfor-mance for applications whose significant working set doesnot fit in the second-level cache, but still fits in memory.Because of this, SHARED-TLB cannot compete withCOMAs and is not a practically useful system to evaluate.

3.2 Dynamic Address Translation in COMA

Fig. 2 points to possible locations for the TLB in a COMA.Besides L0-TLB, L1-TLB, and L2-TLB (as in the CC-NUMAarchitecture), two new positions for the TLB are practical inCOMAs. The TLB can be placed after the local (attraction)memory since the attraction memory acts as a cache and canbe accessed with virtual addresses as any other cache. Wecall this possible placement “AM-TLB.” Note that, in aCOMA, the blocks in the attraction memories can migrateand replicate freely, removing the need for smart static pageplacement, dynamic page migration, and static/dynamicpage replication as is needed in CC-NUMAs. In AM-TLB,the coherence protocol is handled with virtual addresses.

An additional possibility in COMA is to merge the TLBwith the directory at the home node, which eliminates thenotion of physical addresses altogether. As illustrated inFig. 2, we call this placement of the TLB “HOME-TLB.”

3.3 Virtualizing the COMA

3.3.1 Physical COMAs

In L0-TLB, L1-TLB, and L2-TLB, the attraction memory isaccessed with physical addresses. We thus refer to these


Fig. 1. Possible locations for the TLB in CC-NUMA.

Fig. 2. Possible locations for the TLB in COMA.

systems as physical COMAs. This is the classical organiza-tion for COMAs and we now review some of its properties.

In a physical COMA, the attraction memory in everynode is divided into an equal number of sets indexed withphysical addresses. A global set is made of all the sets withthe same number in all attraction memories, as is illustratedin Fig. 3 where each attraction memory is 4-way set-associative. The size of a global set is the product of thenumber of processor nodes and the set size in eachattraction memory. So, the larger the multiprocessor systemis, the larger the global set is. A given block with a givenphysical address is restricted to reside in the global setindexed by its physical address. The number of possibleslots for a given memory block is limited by the size of aglobal set.

Assume that the number of attraction memory sets pernode is S ¼ 2s, the associativity is K ¼ 2k blocks in eachnode, the block size is B ¼ 2b bytes, the number ofprocessor nodes is P ¼ 2p, and the page size is N ¼ 2n

bytes. From Fig. 4, we see that the total physical memoryavailable is 2pþkþsþb bytes and the size of the tags in theattraction memory is pþ k.

Each physical page is mapped to a processor node(called the home node), which, in a COMA, simply meansthat the directory entries of all its blocks are stored at thenode. The home node is pointed to by p bits of the physicalpage number.

The consecutive blocks of a physical page are mapped toconsecutive global sets in the attraction memories. Theblocks of a page occupy slots in consecutive global sets, sothat we can speak of the slot of a page. We can also speak ofthe global page set, which is made of all the contiguous globalsets in which the blocks of a page are mapped to. Thenumber of physical page slots in a global page set is equal tothe degree of associativity in a global set (PxK). Because ofreplication and migration of memory blocks in a global set,only a fraction of the physical page slots may be allocated in

a global page set at any given time. Memory pressure isdefined as the number of slots occupied in a global setdivided by the size of the global set. When memorypressure approaches 100 percent in the global set, pagescannot replicate in that set. If the pressure in a global pageset is too high, a page slot must be deallocated in that set.

To summarize, in a COMA with physically accessedattraction memory, the allocation of physical pages tovirtual pages remains totally flexible. However, two factorsmust guide this allocation for performance reasons. Thefirst one is the allocation of pages to home nodes. Since thedirectory entries of all the blocks in a page reside at thehome node, home nodes should be distributed evenlyacross processor nodes to avoid imbalances in memorytraffic. The second factor is the mapping of pages to globalpage sets. Pages should be mapped uniformly to globalpage sets so that the pressure is evenly distributed acrossthe sets.

3.3.2 AM-TLB

In a COMA, the main memory acts as a cache and may alsobe virtually addressed. In this case, address translation ispostponed until after a miss in the local node, as shown inFig. 2. The physical address then points to the home node,which contains the directory information. Coherence isenforced with virtual addresses, although the home nodeand the directory are accessed with physical addresses.

Because the local attraction memory is now accessedwith virtual addresses, a block and a page are restricted toreside in the global set indexed by their virtual address.Thus, the physical page number of a virtual page must mapinto the same global set as the virtual page number. Thisrestriction on the page allocation strategy is equivalent topage coloring applied to the attraction memory [21]. Fig. 5shows the mapping of addresses when the attractionmemory of a COMA is virtually accessed. The virtual andthe physical addresses have the same color so that the


Fig. 3. Set and Global Set (4-way set-associative memory).

Fig. 4. Address mapping for attraction memory accessed with physical addresses.

global set number accessed with the virtual address is thesame as if the physical address was used. Of course thisconstraint becomes less stringent as the number ofprocessors increases.

In AM-TLB, a TLB is still private to a processor node,which means that TLB entries are replicated across nodesand TLB consistency must be maintained. Memory manage-ment is impacted. The dynamic allocation of page slots topages is constrained by page coloring (a virtual pageaddress must fall in the same global page set as the pageslot allocated to it). However, the mapping of pages tohome nodes is not restricted by the virtual address.

3.4 Home-TLB

We now move the address translation into the home nodeand integrate it completely with the cache coherenceprotocol. In this new design, which we call Home-TLB, thesupport for address translation is located at the home node.Attraction memories are accessed with virtual addresses asin AM-TLB, but, additionally, the home node and thedirectory at the home node are also accessed with virtualaddresses instead of physical addresses. Because attractionmemory accesses, directory accesses, and coherence mes-sages all use virtual addresses, we can bypass the physicaladdresses altogether. Instead, we translate virtual addressesdirectly into directory entry addresses at the home node.

The directory memory in Home-TLB is organized indirectory pages. A directory page has as many entries as thereare blocks in a memory page. The directory memory isallocated and reclaimed in directory page units by the virtualmemory system. Due to the set-associative nature of theattraction memory, the mapping of a virtual page to adirectory page is also set-associative.

Fig. 6 illustrates the mapping and translation of a virtualaddress into a directory entry at the home node. The p leastsignificant bits of the virtual page number point to the homenode in order to interleave the directory pages acrossprocessor nodes. In this allocation, all page directories in aglobal page set are located at the same home node and thenumber of directory pages in a directory page set is the sizeof the global page set. This set size can be very large, as it isthe product of the number of processor nodes and theassociativity of each attraction memory (PxK).

Let’s now trace a memory access in Home-TLB. A miss inthe cache hierarchy in the local node accesses the localattraction memory with the virtual address, as in AM-TLB. Ifthe local attraction memory misses, a message is sent to thehome node pointed to by the virtual address. At the homenode (see Fig. 6), the virtual attraction memory tag con-catenated to the directory set index (s� p� nþ b bits) in the

virtual page number are used to access the TLB. On a hit, the

TLB yields the base address of the directory page and the bits

of the directory page index are added to it to access the

directory entry. If the TLB misses, the directory set index

points to the base address of the directory pages in the

directory set. The virtual address tag is then matched to the

tags in the set. Since the global set can be very large, wemust

use hashing or hierarchical translation. In Fig. 6, we show the

translation using an Inverse Directory Page Table (IDPT).From the point of view of virtual memory manage-

ment, the directory page corresponds to the pageframe in a

classical system. On a page fault, a directory page is

requested from the page’s home node. An entry in the

inverted page table is filled with the necessary virtual

address bits as well as other information. This action

allocates a directory page to the new page.The allocation of physical pages is constrained. A page

slot allocated to a page must reside in the same global

pageframe set as the page, as in AM-TLB. A resident page

may be swapped out by the paging daemon if the memory

pressure of the page’s global set is higher than a threshold.

Additionally, the home node is also bound by the virtual

address. If the load on a particular home node becomes too

heavy, pages may have to be swapped out as well.

4 EXPERIMENTAL EVALUATION

4.1 Methodology

We have run execution-driven simulations to compare the

five options for dynamic address translation in COMAs.

While L0-TLB, L1-TLB, and L2-TLB are possible placements

for the TLB in both CC-NUMAs and COMAs, AM-TLB and

HOME-TLB are only applicable to COMAs.We only simulate shared data accesses. The parameters

of the six SPLASH-2 benchmarks [38] used in our

evaluations are shown in Table 1. The important working

sets always fit in our simulated attraction memory, but

sometimes do not fit in caches.The data set sizes in the benchmarks are orders of

magnitude smaller than the data set sizes of applications

running on the actual system. To offset this, we have to

scale down the sizes of memories, caches, and TLBs.Our simulated architecture has 32 nodes. Each node

contains 4 MB of memory (for a system total of 128MB), a

16 KB L1-cache, and a 64 KB L2-cache. The L1-cache is direct-

mapped and write-through with a block size of 32 bytes. The

L2-cache is 4-way set associative andwrite-back with a block

size of 64 bytes. The attraction memories are also 4-way set


Fig. 5. Page mapping and coloring in AM-TLB and Home-TLB.

associative, and their block size is 128 bytes. The page size is

4 KB for all simulations.The average memory pressure is very low (from 4 percent

to 40 percent). With such low memory pressures, we make

sure that replacement activity is kept low. Additionally, the

data sets easily fit in mainmemory and are preloaded so that

we do not have to simulate paging activity, as our simulator

does not simulate operating system activity.Table 2 compares the total TLB reach for various TLB

sizes with the total memory footprint of each benchmark.

The total TLB reach is the product of the TLB size, the page

size, and the number of processors.

For a TLB of 128 entries, the total TLB reach is

comparable or greater than the memory footprint of all

benchmarks. For 32 entries, the reach is also comparable in

size to the memory footprint. To be able to observe enough

TLB misses and evaluate the effect on the execution times

with our benchmarks, we scaled down the TLB sizes all the

way to eight entries. This brings the ratio between memory

footprint and total TLB reach in line with the ratio for the

full-scale application expected to run on the actual system.An L1 hit has no latency charge and an L2 hit takes six

cycles. A hit in local memory takes 74 cycles. The network is

an 8-bit wide crossbar clocked at half the clock rate of the


Fig. 6. Access to the directory in Home-TLB.

TABLE 1Benchmarks

processors. An 8-byte request takes 16 cycles and a messagecontaining a block takes 272 cycles.

TLB misses can be served by trapping the main processoras in [16], by the protocol engine, or by a cache controllerthrough some programmable logic. In our simulations, weassume that address translation occupies 40 cycles of thecache controller (for L0-TLB, L1-TLB and L2-TLB), or of theprotocol engine (for AM-TLB and Home-TLB). This lowerbound on translation overhead assumes that page tableentries are cached and that accesses to them hit in all caches.Since cache misses are more costly for higher-level cachesthan for translation in memory, the simplification ofcharging a fixed cost of 40 cycles for a TLB miss—whereverthe TLB is—is biased in favor of systems where the TLB iscloser to the processor and is more realistic for systemswhere the translation is done in memory.

4.2 Address Translation Misses

Fig. 7 shows the number of address translation misses pernode as a function of the TLB size. One obvious observationis that the number of address translation misses consistentlydecreases with the location of the TLB away from theprocessor. This is due to a filtering effect by the caches.Namely, the number of misses in a noninclusive TLB cannot

be larger than the number of misses in the cache above it.This effect is especially large when the TLB reach is lessthan the working set.

The case of RADIX stands out. The curves show no clearsignificantworking set for any TLB organization or size, untilthe size reaches 512 entries. RADIX has a disproportionatelylarge number of writes and these write accesses causecoherence transactions and are not filtered by the caches ortheattractionmemory.Except forRADIX, theTLB-miss curvefor AM-TLB is much flatter than for L2-TLB.

In Home-TLB, the number of TLB misses is negligible forall benchmarks, even for very small TLB sizes. This is due toa sharing effect. TLB entries are shared and are notreplicated. Thus, the effective number of TLB entriesincreases proportionally with the number of processors.This is not true for other systems. This effect can be huge, asin RADIX. In each pass of RADIX, a key is written into alarge output array shared and distributed among all nodes.The number of TLB misses in RADIX in Home-TLB isconsistently less than the number of TLB misses in an AM-TLB system with 32 times more TLB (Recall that wesimulate 32 processors). All other benchmarks show similartrends, albeit not as pronounced because their accesspatterns are more complex.


TABLE 2Ratio between Memory Footprints and Total TLB Reach

Fig. 7. Number of address translation misses versu TLB size (fully associative TLB).

The fact that, in the case of RADIX and Home-TLB, theTLB miss rate is lower than the miss rate of an AM-TLB32 times larger, suggests that another effect is at playbesides the sharing effect. For example, a 16-entry TLB inHome-TLB has even less misses than a 512-entry TLB inAM-TLB. This comes from a prefetching effect, a consequenceof the sharing of TLB. For example, if processor 1 writes to ashared page which is then read by processor 2, the TLB misscaused by processor 1 prevents a TLB miss by processor 2.The impact of this prefetching effect is significant for coldmisses when the whole working set fits in TLB. In this case,every page table entry is loaded only once in the wholesystem in Home-TLB instead of once per node in systemswith private TLBs.

Table 3 shows the TLB miss rates (misses/processorreference). In L0-TLB, the miss rates are comparable to SLCmiss rates when the TLB has 8 or 32 entries. Thus, the TLBeffects cannot be ignored. The situation improves somewhatin L2-TLB and AM-TLB. Home-TLB is the only case wherewe could neglect address translation misses as compared tocache misses.

4.3 Relative Miss Rate

As the TLB is moved down the memory hierarchy, the TLBmiss ratio per processor reference decreases. However, aswe have seen, this is mostly due to the filtering effect of thecaches above the TLBs. The actual effectiveness of a TLBgoes down when it is moved further away from theprocessor because the caches above it have absorbed thelocality of the memory reference stream.

The effectiveness of a TLB can be measured by itsrelative miss rate, i.e., the number of TLB misses divided bythe number of TLB accesses. If a TLB is ineffective, then itcould be removed because it does not affect execution timemuch, especially if the frequency of translations is low.Table 4 shows the relative TLB miss rate, for fully

associative TLBs of sizes 8, 32, and 128. TLB effectivenessdrops rapidly as the translation is moved from L0 to AM.However, due to sharing and prefetching effects, weobserve that TLB effectiveness improves dramatically whenthe translation is made at home, even if the TLB only haseight entries. Additionally, the relative miss rate of Home-TLB drops much faster with the TLB size than AM-TLB,although both TLBs are accessed after the attractionmemory. With a 32-entry or a 128-entry fully-associativeTLB, the relative miss rate of the TLB is lower in all cases,but the drop in the relative miss rate is very large forHOME-TLB. This suggests that a TLB is by far much moreeffective in Home-TLB than in any other system, especiallyif the overhead of TLB consistency is factored in.

4.4 TLB-Less Systems

An interesting design choice, which has been proposedbefore in the context of uniprocessors [17], is to remove theTLB altogether and translate addresses directly throughpage tables, in hardware or software. Due to the cachefiltering effect and the poor effectiveness of TLBs, the TLBafter a large virtual cache might be eliminated withoutaffecting system performance significantly. Removing theTLB simplifies the hardware and eliminates the problem ofTLB consistency, which plagues all systems except Home-TLB and which does not scale well in large-scale multi-processors.

The TLB effectiveness in L0-TLB and L1-TLB is fairlygood for these applications and, since the frequency oftranslations is high, removing the TLB in these systems isprobably a bad idea. However, translating addresses afterL2 and without the help of a TLB is a good option, since thefrequency of translation is low and TLB effectiveness is low.These observations have been made before [17]. Our resultssummarized in Table 4 additionally shows that removingthe TLB after the attraction memory would also be a verygood design decision as the TLB effectiveness is even lower


TABLE 3TLB Miss Rates per Processor Reference (%) (Fully Associative TLBs)

TABLE 4Relative TLB Miss Rates (%) (Fully Associative TLBs)

than in L2-TLB. Because the effectiveness of the TLB is sohigh in Home-TLB as compared to other systems, it may notbe as advantageous to remove it.

4.5 Execution Times

We have also run execution-driven simulations to estimateexecution times. In Table 5, we show the average TLBoverhead divided by the average processor stall time onlocal and remote memory accesses for small TLBs of sizes 8and 16.

The data in Table 5 show that address translation is asignificant part of the memory penalty in L0-TLB. Memorypenalties due to address translation become negligible bytranslating addresses at the home node. Remember that wedo not simulate TLB consistency maintenance and itsimpact, which favors L0-TLB.

The effects on the execution time are shown in Fig. 8.Busy is the time spent executing instructions in eachprocessor, sync is the synchronization time, loc-stall countsthe time spent on local cache misses, and rem-stall refers tothe service time for attraction memory misses. We show twosets of results for each benchmark: the leftmost set is forTLB-less systems, and the rightmost set is for fully-associative TLBs of size 8. More results can be found in [27].

Let’s first look at the execution times of the systems withTLBs (rightmost set). In OCEAN, we see a clear perfor-mance advantage for moving the TLB to memory. Barnesand Radix show a similar trend, although it is lesspronounced.

In RAYTRACE, we observe that the execution time(excluding the TLB overhead) of AM-TLB and Home-TLB islarger than in L0-TLB. The cause is not the TLB effective-ness, which is better than in other systems. It is thedefinition of raystruct (which is the private stack for the raytree of each node), in which padding is used to avoid falsesharing. The padding is aligned on multiples of 32KB in thevirtual address space, which creates uneven conflicts inAM-TLB and in Home-TLB, leading to an increase of thesynchronization time. We have observed [25] that, bysimply aligning the padding to one page size (4 KB), thesynchronization time is reduced significantly. Once thisoptimization is applied to the code, we have shown thatRAYTRACE behaves just as the other benchmarks [25]. Thisexample shows that simple virtual address layout optimi-zations done by the programmer or the compiler canimprove the performance of AM-TLB and Home-TLB.

Let’s now focus on systems with no TLB (TLB of size 0).Clearly, as expected, removing the TLB is not a good ideafor L0-TLB and L1-TLB. For L2-TLB, AM-TLB, and Home-TLB, the performance impact of removing the TLB is notsignificant, even though every coherence miss triggers

address translations. The address translation overhead isvery small, as compared to the long latency of coherenceactivities. Although Home-TLB has much better effective-ness than L2-TLB and AM-TLB, removing the TLB at thehome does not seem to hurt Home-TLB because of the verylow access frequency to the TLB. Of course, TLB-lesssystems have no TLB consistency problems, which makesthem desirable for CC-NUMA machines in particular.

5 RELATED WORK

DDM [14] and KSR-1 [4] provided the seminal idea and thefirst implementation concepts for COMA. Because thehierarchical directory in these architectures increases theremote access latency, a “flat” COMA (COMA-F) [18] waslater proposed. The key idea was to decouple accessmethods to data and directory. We have used the sameprotocol as in COMA-F in our comparisons.

Virtual address caches have been the topic of manypapers [1], [12], [21], [36], [29]. A survey of the issues inuniprocessors and multiprocessors has been published [5],[6]. Lynch [21] has evaluated page coloring issues. Hissimulations indicate that the page fault rate does notnoticeably increase with the number of colors. He concludesthat the use of coloring has no deleterious performanceeffects on paging activity. He also indicates that physicalcache performance varies for each run, depending on theallocation of pages from the free list in the operating system.On the other hand, the performance of virtual caches is notsensitive to these implementation decisions.

Jacob et al. [16], [17] proposed a software-managedaddress translation scheme where the hardware TLB iseliminated. A big virtually-indexed virtually-tagged SLCdrastically cuts the frequency of address translations. Thisscheme can be considered as a 0-entry L2-TLB. Ritchie firstproposed an in-cache translation scheme [29]. Although ithad a single level of cache, we can categorize it as an L2-TLB scheme because there is no physically indexed cacheafter the address translation mechanism.

Wang et al. [36] proposed the idea of a two-level virtual-real cache hierarchy where the TLB is placed after the FLC.We have called this system L1-TLB. They proposed to storepointers in the two caches to solve the synonym and thewriteback problems and to enforce inclusion.

Austin and Sohi [2] showed the bandwidth requirementon the TLB inL0-TLB formultiple issue processors. Instead ofbrute force multiported TLBs, they evaluated several meth-ods to expand TLB bandwidth, such as interleaved TLB,multilevelTLB, piggybackportswhich send the translation tosimultaneous arriving requests, and pretranslation that


TABLE 5Address Translation Time/Total Stall Time (%)

allows a single translation request to be used for multiple

memory accesses.Talluri et al. [31], [32] and Romer et al. [30] use

superpages to increase the TLB reach without enlarging

the TLB. In [31] two page sizes, 4 KB and 64 KB, are

supported, and page reservation restricts the allocation of

physical memory and subblock TLB, analogously to a

subblock cache. Romer et al. [30] proposed online promo-

tion. TLB misses are counted and when the miss count

reaches a threshold, a superpage is constructed by copying

and reconstructing the physical memory layout.Finally, Teller [33] proposed an in-memory TLB scheme

for UMA (Uniform Memory Access) architectures to solve

the TLB consistency problem. Extended to Distributed

Shared Memory Systems, this scheme would be equivalent

to mapping pages to nodes based on the virtual address in a

CC-NUMA (“SHARED-TLB”). We have not evaluated this

scheme because the only way to improve node locality in

CC-NUMA is to statically map pages to nodes and to

replicate and migrate pages dynamically in a way that

favors node access locality. This is impossible to accomplish

if the home node is designated by the virtual address.

6 CONCLUSION

In this paper, we have argued that moving the address

translation deeper in the memory hierarchy point relieves

the processor core from handling translations, scales better

with memory sizes and multiprocessor configurations,

eliminates the TLB consistency problem, and reduces the

overhead of translations. Thus, we have explored ways to

move the TLB closer to the shared memory in distributed

shared-memory multiprocessors.In CC-NUMAs, we can move the TLB down the cache

hierarchy, but placing the TLB at the shared-memory makes

any form of intelligent allocation of pages to processing

nodes—either statically or dynamically—practically impos-

sible. This is a hugehandicap forCC-NUMAs, as therewould

be no processor node access locality andmost cache capacity

misses would be remote. We did not pursued this possibility


Fig. 8. Execution times without TLBs and with TLBs of size 8.

in CC-NUMAs because we could not find a way to solve this

problem. Thus, we have pursued COMA architectures in

which processor node access locality is guaranteed by the

hardware, independent of page placement.We have evaluated five options for virtual address

translation. Three options (L0-TLB, L1-TLB, and L2-TLB)

perform the translation in the cache hierarchy and are

applicable to CC-NUMAs as well as COMAs. The other two

options (AM-TLB and Home-TLB) perform the translation

in memory and are only applicable to COMA architectures.

One major contribution of this paper is to introduce and

provide design details for in-memory address translation in

distributed shared memory multiprocessors. We have also

identified the three effects, which makes address translation

effective at the home node: the filtering, sharing, and

prefetching effects.The TLB consistency problem is an issue that we have

not considered, but it creates added overhead and complex-

ity in large-scale systems for all configurations with TLB,

except for Home-TLB. TLB-less systems eliminate TLB

consistency altogether and, provided TLB effectiveness and

translation frequency are low, are just as efficient. Remov-

ing the TLB is a good design choice when the translation is

done after the L2-cache or after the attraction memory

because of the poor TLB effectiveness. We have also showed

that, although TLB effectiveness is very high in Home-TLB,

the TLB can be removed as well because of the very low

frequency of translations.Although our evaluations show that in-memory address

translation has better properties than address translation in

the cache hierarchy, they do not establish an overwhelming

superiority of COMA over CC-NUMA. The location of

address translation is one of the parameters to considerwhen

designing a new DSM system. In the future, it would be

desirable to conduct experiments with data-intensive appli-

cations and realistic applications with much larger data-set

sizes than the ones considered here. It is very possible that the

effect of address translation onoverall performancewould be

much more pronounced. Also, the overhead of TLB consis-

tency should be included in future work.More ideas on how to exploit a machine running on

virtual addresses such as Home-TLB can be found in [27].

ACKNOWLEDGMENTS

This work was supported by the US National Science

Foundation under Grant No. MIP-9633542.

REFERENCES

[1] A. Agarwal, Analysis of Cache Performance for Operating System andMultiprogramming. Boston: Kluwer Academic Publishers, 1989.

[2] T. Austin and G. Sohi, “High-Bandwidth Address Translation forMultiple-Issue Processors,” Proc. 22nd Ann. Int’l Symp. ComputerArchitecture (ISCA), pp. 158-167, 1996.

[3] E. Bugnion, J.M. Anderson, T.C. Mowry, M. Rosenblum, and M.S.Lam, “Compiler-Directed Page Coloring for Multiprocessor,”Proc. Seventh Conf. Architecture Support for Programming Languagesand Operating Systems (ASPLOS), Oct. 1996.

[4] H. Burkhardt III et al., “Overview of the KSR-1 ComputerSystem,” Technical Report KSR-TR-9202001, Kendall SquareResearch, Feb. 1992.

[5] M. Cekleov and M. Dubois, “Virtual-Address Caches, Part 1:Problems and Solutions in Uniprocessors,” IEEE Micro, pp. 64-71,Sept./Oct. 1997.

[6] M. Cekleov and M. Dubois, “Virtual-Address Caches, Part 2:Multiprocessor Issues,” IEEE Micro, Nov./Dec. 1997.

[7] J. Chase, H. Levy, and M. Feeley, “Sharing and Protection in aSingle-Address-Space Operating System,” ACM Trans. ComputerSystems, pp. 271-307, Nov. 1994.

[8] J.B. Chen and A. Borg, “A Simulation Based Study of TLBPerformance,” Proc. 19th Ann. Int’l Symp. Computer Architecture(ISCA), pp. 114-123, May 1992.

[9] D.W. Clark and J.S. Emer, “Performance of the VAX-11/780Translation Buffer: Simulation and Measurement,” ACM Trans.Computer Systems, vol. 3, no. 1, Feb. 1985.

[10] M. Dubois, “Fighting the Memory Wall with Assisted Execution,”Proc. 2004 Computing Frontiers Conference, pp. 168-180, Apr. 2004.

[11] K. Gharachorloo, A. Gupta, and J. Hennessy, “PerformanceEvaluation of Memory Consistency Models for Shared-MemoryMultiprocessors,” Proc. Fourth Conf. Architecture Support forProgramming Languages and Operating Systems (ASPLOS), pp. 245-257, 1991.

[12] J.R. Goodman, “Coherency for Multiprocessor Virtual AddressCaches,” Proc. Second Conf. Architecture Support for ProgrammingLanguages and Operating Systems (ASPLOS), 1987.

[13] L. Gwennap, “Design Concepts for Merced, Forecasting the InnerWorkings of the Decade’s Most Anticipated Processor,” Micro-processor Report, pp. 9-11, vol. 11, no. 3, Mar. 1997.

[14] E. Hagersten, A. Landin, and S. Haridi, “DDM-A Cache-OnlyMemory Architecture,” Computer, vol. 25, no. 9, pp. 44-54, Sept.1992.

[15] J. Huck and J. Hays, “Architecture Support for Translation TableManagement in Large Address Space Machines,” Proc. 20th Ann.Int’l Symp. Computer Architecture (ISCA), pp. 39-50, 1993.

[16] B. Jacob and T. Mudge, “Software-Managed Address Transla-tion,” Proc. Third Int’l Symp. High Performance Computer Architec-ture (HPCA), Feb. 1997.

[17] B. Jacob and T. Mudge, “Uniprocessor Virtual Memory withoutTLBs,” IEEE Trans. Computers, vol. 50, no. 5, pp. 482-499, May2001.

[18] T. Joe, “COMA-F: A Non-Hierarchical Cache Only MemoryArchitecture,” PhD thesis, Stanford Univ., 1995.

[19] E.J. Koldinger, J.S. Chase, and S.J. Eggers, “Architecture Supportfor Single Address Space Operating System,” Proc. Fifth Conf.Architecture Support for Programming Languages and OperatingSystems (ASPLOS), pp. 175-186, Oct. 1992.

[20] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K.Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A.Gupta, M. Rosenblum, and J. Hennessy, “The Stanford FLASHMultiprocessor,” Proc. 21st Ann. Int’l Symp. Computer Architecture(ISCA), pp. 302-313, 1994.

[21] W. Lynch, “The Interaction of Virtual Memory and CacheMemory,” Technical Report CSL-TR-93-587, PhD thesis, StanfordUniv., 1993.

[22] The PowerPC Architecture: A Specification for a New Family of RISCProcessors, C. May, E. Silha, R. Simpson, and H. Warren, eds. SanFrancisco: Morgan Kaufmann Publishers, 1994.

[23] A. Moga, A. Gefflaut, and M. Dubois, “Hardware vs. SoftwareImplementation of COMA,” Proc. 1997 Int’l Conf. Parallel Proces-sing, pp. 248-256, Aug. 1997.

[24] D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, and R.Brown, “Design Tradeoffs for Software-Managed TLBs,” Proc.20th Ann. Int’l Symp. Computer Architecture (ISCA), pp. 27-38, 1993.

[25] X. Qiu and M. Dubois, “Options for Dynamic Address Translationfor COMAs,” Proc. 25th Ann. Int’l Symp. Computer Architecture(ISCA), pp. 214-225, 1998.

[26] X. Qiu and M. Dubois, “Tolerating Late Memory Traps for ILPProcessors,” Proc. 26th Ann. Int’l Symp. Computer Architecture(ISCA), pp. 76-87, 1999.

[27] X. Qiu, “Towards Virtually-Addressed Memory Hierarchies,”PhD thesis, Dept. of Electrical Eng. Systems, Univ. of SouthernCalifornia, Aug. 2000.

[28] X. Qiu and M. Dubois, “Towards Virtually-Addressed MemoryHierarchies,” Proc. Seventh Int’l Symp. High Performance ComputerArchitecture (HPCA), pp. 51-62, Jan. 2001.

[29] S. Ritchie, “TLB for Free: In-Cache Address Translation for aMultiprocessor Workstation,” Technical Report UCB/CSD 85/233, Univ. of California at Berkeley, May 1985.


[30] T.H. Romer, W.H. Ohlrich, and A.R. Karlin, “Reducing TLB andMemory Overhead Using Online Promotion,” Proc. 22nd Ann. Int’lSymp. Computer Architecture (ISCA), pp. 176-187, 1995.

[31] M. Talluri and M.D. Hill, “Surpassing the TLB Performance ofSuperpages with Less Operating System Support,” Proc. SixthConf. Architecture Support for Programming Languages and OperatingSystems (ASPLOS), 1994.

[32] M. Talluri, S. Kong, M.D. Hill, and D.A. Patterson, “Tradeoffs inSupporting Two Page Sizes,” Proc. 19th Ann. Int’l Symp. ComputerArchitecture (ISCA), pp. 415-424, May 1992.

[33] P. Teller and A. Gottlieb, “Locating Multiprocessor TLBs atMemory,” Proc. 27th Ann. Hawaii Int’l Conf. System Science, pp. 554-563, 1994.

[34] M. Tremblay and J.M. O’Connor, “Ultrasparc I: A Four-IssueProcessor Supporting Multimedia,” IEEE Micro, pp. 42-50, Apr.1996.

[35] P. Stenstrom, T. Joe, and A. Gupta, “Comparative PerformanceEvaluation of Cache-Coherent NUMA and COMA Architectures,”Proc. 19th Ann. Int’l Symp. Computer Architecture, pp. 80-91, May1992.

[36] W.H. Wang, J.-L. Baer, and H.M. Levy, “Organization andPerformance of a Two-Level Virtual-Real Cache Hierarchy,” Proc.16th Ann. Int’l Symp. Computer Architecture (ISCA), pp. 140-148,June 1989.

[37] H. Wang, T. Sun, and Q. Yang, “CAT—Caching Address Tags, ATechnique for Reducing Area Cost of On-Chip Caches,” Proc. 22ndAnn. Int’l Symp. Computer Architecture (ISCA), pp. 381-390, 1995.

[38] S.C. Woo, M. Ohara, and E. Torrie, “The SPLASH-2 Programs:Characterization and Methodological Considerations,” Proc. 22ndAnn. Int’l Symp. Computer Architecture (ISCA), pp. 24-36, 1995.

[39] K.C. Yeager, “The MIPS R10000 Superscalar Microprocessor,”IEEE Micro, pp. 28-40, Apr. 1996.

[40] C. Zilles, J. Emer, and G. Sohi, “The Use of Multithreading forException Handling,” Proc. 32nd Ann. Int’l Symp. Microarchitecture(Micro-32), 1999.

Xiaogang Qiu received the PhD degree incomputer engineering from the University ofSouthern California. He is a staff engineer atSun Microsystems, Inc. His research interestsinclude computer architecture, design and ver-ification of microprocessor, and multiprocessorsystems.

Michel Dubois received the PhD degree fromPurdue University, the MS degree from theUniversity of Minnesota, and an engineeringdegree from the Faculte Polytechnique de Monsin Belgium, all in electrical engineering. He is aprofessor in the Department of Electrical En-gineering of the University of Southern Califor-nia. Before joining USC in 1984, he was aresearch engineer at the Central ResearchLaboratory of Thomson-CSF in Orsay, France.

His main interests are computer architecture and parallel processing,with a focus on multiprocessor architecture, performance, and algo-rithms. He has published more than 150 papers in technical journals andleading conferences on these topics. He is a member of the ACM and afellow of the IEEE and the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Documents

Moving address translation closer to memory in distributed shared-memory multiprocessors