6
PreTrans: Reducing TLB CAM-Search via Page Number Prediction and Speculative Pre-Translation Jiachen Xue Mithuna Thottethodi School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47907-2035 Email: {xuej, mithuna}@purdue.edu Abstract—The need for fast address translation within tight time constraints (before L1 tag check but after effective address computation) imposes many design constraints. The freedom from such constraints can potentially lead to lower TLB energy costs. In this paper, we observe that (1) data accesses commonly use base-displacement addressing modes in which the effective address is computed as the sum of a base and a displacement, and (2) the effective page numbers are predictable once the base address is known. Further, it is easy to cache address translations alongside the predicted page numbers thus enabling speculative address translation that can filter accesses to the TLB. The two observations enable our PreTrans design in which (a) a speculative translation is available based solely on the base address, and (b) the translation is available simultaneously with the effective (virtual) address. PreTrans replaces most of the energy-expensive CAM-lookups for TLB access with RAM lookups, which translates to significant power improvements in the TLB. KeywordsPower, TLB, Speculation, Prediction I. I NTRODUCTION Processors for netbook/tablet class systems such as the ARM’s Cortex A-15, A-9 [1], [2] and Intel’s Atom [3] line must achieve both low-power and high performance; goals that are often in conflict. Targeting power in such processors is challenging because modern processor microarchitectures do not have a single component that dominates the core’s power consumption. Rather, while some entities consume more power than others (e.g., issue queue, ALUs), many structures consume non-negligible fractions of power (e.g., register files and fully-associative TLBs). As such, the power cost of individual components must be individually targeted to reduce overall power. In this paper, we focus on the energy cost of searching CAMs (content addressable memories) in fully-associative TLBs such as those in the ARM A15 [1] 1 . The performance goal of processors requires fast address translation with a very high hit ratio (95+%) in the TLBs. However, the fully- associative CAM-based TLB structures which are accessed on every instruction fetch and every data memory access can impose a high energy/power cost. The goal of this paper is to reduce the CAM-search access energy without negatively affecting performance. 1 To further confirm the value of fully-associative TLBs, we measured miss rates for lower-associativity TLB organizations. Indeed, some benchmarks saw a degradation of as much as 8% in TLB miss rates with lower-associative TLBs. Our approach is to use prediction and speculation for translation energy optimization. There exist several techniques that use speculation for performance (e.g., prefetching and value prediction; discussed in Section VIII). In contrast, our technique uses speculation for energy, which involves (1) spec- ulatively performing a low-energy operation that is predicted to replace a high-energy operation, and (2) falling back on the high-energy operation if the speculative low-energy operation fails. One may think that well-understood principles of hierarchy can be applied to reduce TLB access energy. Even though there is no prediction/ speculation involved, the higher-levels of a hierarchy typically consume less power and can obviate the need for lower-level access in the common case. For example, the addition of a small L0 TLB could filter accesses to L1 TLB, thus saving access energy/power. However, such a design is not ideal as the misses in the smaller L0 TLB can be fairly disruptive, especially in modern out-of-order processors where many instructions may have to be squashed and restarted whenever there is a L0 TLB miss. (Out-of-order processors may issue dependent instructions that assume a TLB hit. Such instructions would have to be squashed and reissued.) Later we confirm in Section VI that even under optimistic assumptions, an L0 cache can significantly degrade performance. Alternately, one could use an L0 TLB, but force the scheduler to conservatively schedule instructions assuming that TLB accesses will be misses. However, such scheduling is likely to significantly reduce performance. (Back-to-back dependent operations with one memory access will have a bubble in between.) If it were possible to lookup the L0 TLB in advance of the L1 TLB access such that an L0 miss will not cause any further delays, that would help. Unfortunately, the timing constraints for translation offer little flexibility for an L0 TLB lookup and L1 TLB lookup to both occur before data cache access (specifically, the tag match). On the one hand, translation may not begin earlier than effective address computation. On the other hand, translation is needed before the cache tags may be compared. In fact, as described above, even the little delay slack that exists to achieve translation is a consequence of using virtually-indexed physically-tagged cache designs in which translation may overlap partially with cache indexing. Because PreTrans uses prediction, it is not bounded by the same timing constraints as non-speculative translation. As such, PreTrans is able to speculatively pretranslate addresses (in the common case) even before the effective (virtual) address is computed. PreTrans achieves fairly high prediction accuracy PreTrans: Reducing TLB CAM-Search via Page Number Prediction and Speculative Pre-Translation 978-1-4799-1235-3/13/$31.00 ©2013 IEEE 341 Symposium on Low Power Electronics and Design

[IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

  • Upload
    mithuna

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

PreTrans: Reducing TLB CAM-Search via PageNumber Prediction and Speculative Pre-Translation

Jiachen Xue Mithuna ThottethodiSchool of Electrical and Computer Engineering

Purdue UniversityWest Lafayette, IN 47907-2035

Email: {xuej, mithuna}@purdue.edu

Abstract—The need for fast address translation within tighttime constraints (before L1 tag check but after effective addresscomputation) imposes many design constraints. The freedomfrom such constraints can potentially lead to lower TLB energycosts. In this paper, we observe that (1) data accesses commonlyuse base-displacement addressing modes in which the effectiveaddress is computed as the sum of a base and a displacement,and (2) the effective page numbers are predictable once thebase address is known. Further, it is easy to cache addresstranslations alongside the predicted page numbers thus enablingspeculative address translation that can filter accesses to theTLB. The two observations enable our PreTrans design in which(a) a speculative translation is available based solely on thebase address, and (b) the translation is available simultaneouslywith the effective (virtual) address. PreTrans replaces most ofthe energy-expensive CAM-lookups for TLB access with RAMlookups, which translates to significant power improvements inthe TLB.

Keywords—Power, TLB, Speculation, Prediction

I. INTRODUCTION

Processors for netbook/tablet class systems such as theARM’s Cortex A-15, A-9 [1], [2] and Intel’s Atom [3] linemust achieve both low-power and high performance; goalsthat are often in conflict. Targeting power in such processorsis challenging because modern processor microarchitecturesdo not have a single component that dominates the core’spower consumption. Rather, while some entities consume morepower than others (e.g., issue queue, ALUs), many structuresconsume non-negligible fractions of power (e.g., register filesand fully-associative TLBs). As such, the power cost ofindividual components must be individually targeted to reduceoverall power.

In this paper, we focus on the energy cost of searchingCAMs (content addressable memories) in fully-associativeTLBs such as those in the ARM A15 [1]1. The performancegoal of processors requires fast address translation with avery high hit ratio (95+%) in the TLBs. However, the fully-associative CAM-based TLB structures which are accessedon every instruction fetch and every data memory access canimpose a high energy/power cost. The goal of this paper isto reduce the CAM-search access energy without negativelyaffecting performance.

1To further confirm the value of fully-associative TLBs, we measured missrates for lower-associativity TLB organizations. Indeed, some benchmarks sawa degradation of as much as 8% in TLB miss rates with lower-associativeTLBs.

Our approach is to use prediction and speculation fortranslation energy optimization. There exist several techniquesthat use speculation for performance (e.g., prefetching andvalue prediction; discussed in Section VIII). In contrast, ourtechnique uses speculation for energy, which involves (1) spec-ulatively performing a low-energy operation that is predictedto replace a high-energy operation, and (2) falling back on thehigh-energy operation if the speculative low-energy operationfails.

One may think that well-understood principles of hierarchycan be applied to reduce TLB access energy. Even thoughthere is no prediction/ speculation involved, the higher-levelsof a hierarchy typically consume less power and can obviatethe need for lower-level access in the common case. Forexample, the addition of a small L0 TLB could filter accessesto L1 TLB, thus saving access energy/power. However, sucha design is not ideal as the misses in the smaller L0 TLBcan be fairly disruptive, especially in modern out-of-orderprocessors where many instructions may have to be squashedand restarted whenever there is a L0 TLB miss. (Out-of-orderprocessors may issue dependent instructions that assume aTLB hit. Such instructions would have to be squashed andreissued.) Later we confirm in Section VI that even underoptimistic assumptions, an L0 cache can significantly degradeperformance. Alternately, one could use an L0 TLB, but forcethe scheduler to conservatively schedule instructions assumingthat TLB accesses will be misses. However, such schedulingis likely to significantly reduce performance. (Back-to-backdependent operations with one memory access will have abubble in between.)

If it were possible to lookup the L0 TLB in advance of theL1 TLB access such that an L0 miss will not cause any furtherdelays, that would help. Unfortunately, the timing constraintsfor translation offer little flexibility for an L0 TLB lookupand L1 TLB lookup to both occur before data cache access(specifically, the tag match). On the one hand, translationmay not begin earlier than effective address computation. Onthe other hand, translation is needed before the cache tagsmay be compared. In fact, as described above, even the littledelay slack that exists to achieve translation is a consequenceof using virtually-indexed physically-tagged cache designs inwhich translation may overlap partially with cache indexing.

Because PreTrans uses prediction, it is not bounded bythe same timing constraints as non-speculative translation. Assuch, PreTrans is able to speculatively pretranslate addresses(in the common case) even before the effective (virtual) addressis computed. PreTrans achieves fairly high prediction accuracy

PreTrans: Reducing TLB CAM-Search via Page Number Prediction and Speculative Pre-Translation

978-1-4799-1235-3/13/$31.00 ©2013 IEEE 341 Symposium on Low Power Electronics and Design

Page 2: [IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

(99% for ITLB on both ARM ISA and X86 ISA, and 75%for DTLB on ARM ISA and 52% on x86 ISA). Further,PreTrans validates its speculation without a TLB lookup. Thus,in the common case, PreTrans can completely avoid energy-expensive TLB CAM-lookups.

The design of PreTrans is based on the following key obser-vations: In the common case, it is possible to predict the virtualand physical page numbers based solely on the base-address inbase-displacement addressing mode. This predictability arisesfrom a combination of underlying causes. In some cases, thereis a one-to-one mapping from base addresses to effectiveaddresses that lasts for the duration of the program. In thesecond case, even when a base address may map to multipleeffective addresses over the execution of the program, there areextended periods wherein the mapping remains unchanged.

Ordinarily this observation would have no value as boththe prediction and the true effective address become availableat approximately the same time. (The base address becomesavailable (either from the register file or from the bypassnetwork) one cycle before the effective address is computed.)However, the second component of PreTrans pretranslates theaddress by caching VA/PA pairs in the prediction mechanism.Put together, PreTrans can (a) predict the physical address atthe same time that the effective address becomes ready, and(2) validate whether the prediction is correct without going tothe TLB (by comparing if the predicted effective address isthe same as the true effective address).

Full system simulations with Parsec and SPEC2006 CPUbenchmarks reveal that PreTrans can achieve 84% to 91%reduction in power with no degradation in performance.

In summary, the key contributions of this paper are:

• We observe that the common base-displacement ad-dressing mode creates an opportunity for early avail-ability of the page number. Specifically, we observethat it is possible to predict the page (and hence thetranslation) from the base address.

• We design a simple pretranslation predictor that iscorrect 75% of the time for the ARM architecture and52% for the x86 architecture on data translation, andmore than 99% accuracy on instruction translation forboth ARM and X86 architectures.

• The predictor enables significant reduction in thenumber of accesses to highly associative TLBs thussaving 90% and 85% of TLB energy, on average, forARM and x86 respectively.

The rest of the paper is organized as follows. Section IIprovides a brief background on the terms and naming conven-tions used the remainder of this paper. Section III describesthe intuition behind the predictability of virtual addresses.Section IV describes the hardware design of PreTrans. Sec-tions V and VI present experimental results to validate ourclaims about PreTrans. Section VII outlines possible futurework. Section VIII relates this paper to prior literature on TLBpower savings. Finally, we conclude in Section IX.

II. BACKGROUND

Figure 1(a) shows the key stages that are relevant to addresstranslation, in general, and PreTrans, in particular. Consider a

single load instruction ld r2, 4(r3) with base-displacementaddressing mode typical of the MIPS ISA (with correspondinganalogues in other ISAs). The effective address computationis the sum of the content of register r3 (the base address)and the immediate operand 4 (the displacement). As such, thebase address is available (either from the register file or fromthe bypass network) before the ALU operation to computethe effective (virtual) address. The effective address is thentranslated to the corresponding physical address to access thecache. We assume physically-tagged L1 caches. This solves thesynonym problem and addresses some practical compatibilityconcerns(e.g., x86 page-walkers require that cached data beaccessible via physical addresses). However, one common op-timization uses virtually indexed, physically tagged L1 cacheswhich enable cache access to proceed with the indexing partof the access without waiting for translation; only the tag-check needs the translation to be complete. We include thisoptimization in our base case.

Finally, the organization of the TLB matters. While serverCPUs have moved to set-associative TLB organizations, CPUsthat target lower end clients (e.g., tablets) use small, fullyassociative, TLBs. For example, Intel’s Atom [3] and ARM’scortex A-15 both use 32-entry, fully associative TLBs [1]. Full-associative TLBs also enjoy the advantage that they supportmultiple page sizes, without TLB duplication.

While the above discussion mainly centers around the dataTLB (DTLB), the instruction TLB (ITLB) has similar con-straints except that the program counter is the effective addressrather than the base-displacement computation required fordata access.

III. PREDICTABILITY OF EFFECTIVE PAGE NUMBERS

The key observation that we make in this paper is that thebase address in base-displacement addressing is predictive ofthe final effective page number because of one or more of thefollowing three reasons.

1) A significant fraction static instructions use base ad-dresses that map to exactly one effective page number(i.e., a one-to-one mapping) in the entire duration ofthe run. Note, this does not necessarily mean that thepage number is always on the same page as the baseaddress. For such cases, using the previously-seenpage number as a prediction will always be correct.

2) Though the static fraction of instructions’ base ad-dresses that enjoy one-to-one mappings to page maynot be high, the dynamic fraction of occurrencesof such base addresses may be high. Again, thesimple technique of using the prior page number asa prediction will be accurate.

3) Finally, even in cases where the number of dynamicoccurrences of base addresses with one-to-many map-pings2 are common, the prediction will be effectiveif there are significant run-lengths of the same baseaddress to page number mappings.

Our measurements based on full system simulation ofSPEC and Parsec reveal that for DTLB, page number pre-dictability comes mostly from reason (2) and to a much lesserextent from reason (3). On average (across Parsec and SPEC

2The one-to-many mapping is over the duration of the execution. A givendynamic instance of a base address maps to exactly one page number.

Page 3: [IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

(a) Baseline (b) PreTrans

Fig. 1. Organization and timing of PreTrans

2006), we found that only 38% of static base addresses hadone-to-one mappings. Thus, reason (1) was not the primarydriver. However, when we examined the dynamic numbers, wefound that over 85% of dynamic instances were those with one-to-one mappings (which maps to reason 2). The final predictionaccuracy (Section VI) was further enhanced because of limitedruns of stable base-to-page-number mappings which furtherimproved our prediction accuracy.

IV. PRETRANS DESIGN

Given the above observation that page numbers are pre-dictable, we design a simple table-based predictor mechanismthat is accessed using the base address. The PreTrans table(PTT) includes tags (to prevent base address aliasing), of thepredicted page. In addition, the entry also includes a cachedcopy of the TLB entry (which includes physical address andother permission bits) for the predicted page which enablespre-translation in the common case.

Without the physical address in the PTT entry, page predic-tion is not useful because page prediction becomes availableat the same time as the computed effective address. However,with the pre-translation, PreTrans offers two key benefits. First,if the prediction is correct (i.e., the predicted page and thepage of the computed effective address are the same) we usethe predicted translation for cache access without the needfor a power-hungry DTLB access. Second, if the prediction isincorrect (i.e., the predicted effective address and the computedeffective address differ), we are no worse than before; weuse the computed effective address to access the DTLB fortranslation.

Operation of PreTrans: Figure 1(b) illustrates the in-clusion of the PTT in the baseline architecture. To validate theprediction, PreTrans compares the predicted effective address(VA’) page number to the true page number virtual address(VA) page number. The comparison involves a simple com-parator in parallel with register bypass. Because the ALUoperation with self-bypass is a single-stage pipeline loop, theclock cycle will accommodate the bypass delay as well. (Anarchitecture that does not accommodate the bypass networkdelay in a single clock cycle may not issue back-to-backdependent instructions, which can significantly degrade per-formance.)

It is instructive to compare speculative pretranslation to anL0 TLB as they both effectively cache TLB entries and filteraccesses to the TLB. However, the PTT is different in thatit achieves translation on a speculative page number. Vanilla

L0-TLBs offer no such advantages as they have to wait foreffective address computation anyway.

Because PreTrans uses the previous base-to-page-numbermapping as the prediction for the next, our predictor is effec-tively a cache. Like conventional caches the PTT is managedin hardware in a demand-fault mode. Accesses that are notfound in the PTT will be brought in to the PTT after thetranslation has been obtained either via the TLB or via page-fault handling. Accesses that are found, but are mispredictionsare updated with the most recent prediction and translation aswell. Replacements are also handled as in caches; we use LRUreplacement.

Because the PTT is both a cache and a prediction mecha-nism, it may fail to offer an accurate translation in one of twodifferent ways. First, PreTrans may predict the wrong virtualaddress (i.e., a misprediction), which can be verified by using asimple comparator after the true effective address is computed.Next, we may find that the PTT does not hold a prediction forgiven given base address (i.e., a cache miss), in which case wemake no prediction or pre-translation. In either case, we updatethe PTT with a predicted virtual address and translation.

Handling instruction fetch: Unlike the base-displacement addressing mode of data accesses, the PC-directaddressing for instruction fetches provides no equivalent tothe base address. Consequently, we rely on basic page-levellocality to and use the previous PC to provide a prediction forthe next PC page. Because instruction footprints are modestfor our benchmarks, our simple prediction scheme achievesnear-perfect prediction and pre-translation. Other applicationswith larger instruction footprints may see worse behavior.(Indeed, there are known classes of applications with largeinstruction footprint [14]. However, those workloads are notexpected to run on client tablet/netbook platforms which isour domain of interest.)

Predictor Coherence: The operation of PreTrans re-quires that the PTT entries be kept coherent with the TLB.We achieve that by enforcing inclusion in the PTT withrespect to the L1 TLB. This implies that any replacementor shootdown from the L1 TLB requires the deletion of thePTT entries corresponding to that TLB entry. This poses twochallenges. First, we must map from a virtual address to thebase-address which holds that virtual address as its prediction.Second, because the PTT is based on base addresses, it ispossible to have multiple entries in the PTT that have the samepredicted virtual address (and hence the same translation). Inour experiments, we used a heavy-handed approach to locateall such entries via CAM-lookup for eviction and account for

Page 4: [IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

the power-overheads of such CAM lookups. One may thinkthat using such associative search renders the PTT as powerhungry as the TLB. However, the PTT does broadcast/CAMsearches for every TLB miss (which causes an eviction) ratherthan for every TLB hit as in traditional TLBs. Moreover, wedo this as a way to conservatively account for the PTT’s poweroverheads. Alternative implementations that use a RAM-basedreverse-lookup table will use less power.

V. METHODOLOGY

We evaluate PreTrans using the GEM5 full-system simu-lator [9]. We use the McPAT infrastructure for power/energymodeling [20].

We simulate both ARM ISA and x86 ISA (the two ISAswhich have a significant presence in the tablet space), andfor each ISA, we have two system configurations: one-coreand 4-core. Detailed system configurations can be found inTable I. Each core has one level split instruction and dataTLBs, and each TLB is a 32-entry fully-associative modeledafter the Cortex A-15 cores [1]3. The cache hierarchy has asplit L1 instruction and data cache associated to each core, anda unified L2 cache. In the 4-core configuration, the unified L2cache is shared among all 4 cores.

For comparison, we also include a configuration with an L0TLB with 4 or 8 entries. An L0 TLB miss can be served by theregular L1 TLB. We show performance comparisons assumingin-order processors wherein we model the impact of a L0TLB miss as a one cycle delay before cache access. BecauseL0 misses are more expensive in out-of-order processors (seeSection I), the in-order performance degradation serves as anlower bound on the performance degradation in out-of-orderprocessors with L0 TLBs.

We simulate PARSEC [8] benchmarks as our multicoreworkloads and SPEC2006 CPU benchmarks [17] as our se-quential workloads. We used “simlarge” data sets for PARSEC,and reference data inputs for SPEC2006 CPU benchmarks.

We evaluate two PTT configurations: a 32-entry 1-wayassociative PTT and a 32-entry, 2-way associative PTT. EachPTT entry consists of three fields: base address tag, predictedvirtual page number and its translation. Both base address tagand predicted virtual page number fields have 8 bytes thatcorresponds to 64-bit virtual address space. The translationfield stores a copy of the translation from the backup TLB,and is configured to have the full TLB payload of 8 bytes(assuming 48 bits of physical address space and 16 bits forother information including permissions). Therefore, for bothISAs, the total size of the PTT is 768 = (8 + 8 + 8) × 32bytes.

VI. EXPERIMENTAL RESULTS

The key highlights from our results are as follows:

• PreTrans is effective in both ARM and x86 ISAsin reducing TLB lookups. For identical performance,PreTrans reduces DTLB power by more than 71%and 48%, on average, for the ARM and x86 ISAs,

3Even though the x86-based Atom has a 32-entry full-associative ITLB+ 16 entry/thread, fully associative DTLB, we use uniform 32-entry fully-associative TLBs in our experimental configuration for simplicity.

respectively. ITLB power is almost wholly eliminated(i.e., PTT energy dominates total translation energy).

• Compared against smaller L0 TLBs, PTT avoidsthe performance penalty by using speculative pre-translation (whereas L0 TLBs have to wait for addresscomputation to complete).

A. Energy/Performance tradeoffs

Figure 2 plots normalized translation energy (X-axis) andmean performance (Y-axis) of the various configurations forthe Parsec and SPEC2006 CPU benchmarks for each of thetwo ISAs. Note, translation energy includes PTT access energyas well. Both energy and performance are normalized to thatof the base case.

Broadly, we observe significant power savings with Pre-Trans in all four cases. On average, we observe 91% savingsfor the ARM Parsec, and 89% for the ARM SPEC. ForX86, the savings are 86% and 84% for Parsec and SPECrespectively. while maintaining the same performance as theoriginal baseline. Further, the L0 configurations are strictlyworse than our PreTrans configurations; they are worse inboth energy savings and in performance. Finally, from theclustering of the L0 points and the PTT points in the energy-performance space, we see that while the organization of thePTT and the size of the L0 have some impact on the energyand performance, the overall trend (that PreTrans is better thanL0-TLB) is unchanged.

Figure 3(a) and Figure 3(b) plot the PTT hit-ratio forboth ISAs under different configurations. Similarly, Figure 3(c)and Figure 3 (d) illustrate the energy consumption breakdown(averaged across all benchmarks). While higher-associativityachieves marginally better hit-ratio (and marginally worseenergy consumption) as expected, the difference is overshad-owed by the effect of the CAM-access elimination. Underthe same ISA and benchmark suite, instruction PTT has ahigher prediction accuracy than data PTT (on average 20%better in the case of ARM and 50% better for X86). This alsocontributes to more energy savings from instruction PTT thanthe data PTT. Due to the limited space, we show the standarddeviation of the hit ratio instead of each individual benchmarkfrom Parsec and SPEC2006 CPU.

Understanding prediction accuracy: Recall, a TLBlookup for a given base address may occur under two differenttypes of failures: (1) a miss in the PTT which implies thatthere is no available prediction, and (2) a misprediction whichimplies that the predicted effective address differs from theactual effective address. We found that changing the associa-tivity of the PTT from 1 to 2 had little impact on the PTTmiss rate and that the dominant cause of prediction failurewas misprediction (i.e., using the last-page-number caused themisprediction). Due to lack of space, we do not present details(graphs) of this result.

VII. DISCUSSION

PreTrans makes several simplifying assumptions in itsimplementation. As such, its prediction accuracy and resultantpower-savings are a lower limit on what may be achievedvia speculative pre-translation. Some of the open questionsinclude: (1) Can the use of the program counter (PC) in ad-dition to the base address result in more accurate predictions?

Page 5: [IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

CPU unicore 4-coreL1 TLB Private, Split Data and Instruction, 32 en-

tries, Fully associativeSplit Data and Instruction, 32 entries, Fullyassociative

L1 Cache Private, Split Data and Instruction, 32KB 8-way set associative

Private, Split Data and Instruction, 32KB 8-way set associative

L2 Cache Private, unified, 1MB 8-way set associative Shared, unified, 4MB 8-way set associativeTABLE I. BASELINE SYSTEM CONFIGURATIONS

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Perfo

rman

ce

Translation Energy

(a). ARM PARSEC

BasecasePTT 1-wayPTT 2-way

L0-4L0-8

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Perfo

rman

ce

Translation Energy

(b). ARM SPEC

BasecasePTT 1-wayPTT 2-way

L0-4L0-8

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Perfo

rman

ce

Translation Energy

(c). X86 PARSEC

BasecasePTT 1-wayPTT 2-way

L0-4L0-8

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Perfo

rman

ce

Translation Energy

(d). X86 SPEC

BasecasePTT 1-wayPTT 2-way

L0-4L0-8

Fig. 2. Energy-Performance Tradeoffs

0

0.2

0.4

0.6

0.8

1

1.2

Hits

Rat

io

(a). PTT Hits Ratio (PARSEC)

Inst. DataARM Inst. DataX86

PTT 1-wayPTT 2-wayBasecase

0

0.2

0.4

0.6

0.8

1

1.2

Hits

Rat

io

(a). PTT Hits Ratio (PARSEC)

Inst. DataARM Inst. DataX86

PTT 1-wayPTT 2-wayBasecase

0

0.2

0.4

0.6

0.8

1

1.2

Hits

Rat

io

(b). PTT Hits Ratio (SPEC2006 CPU)

Inst. DataARM Inst. DataX86

PTT 1-wayPTT 2-wayBasecase

0

0.2

0.4

0.6

0.8

1

1.2

Hits

Rat

io

(b). PTT Hits Ratio (SPEC2006 CPU)

Inst. DataARM Inst. DataX86

PTT 1-wayPTT 2-wayBasecase

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Ene

rgy

(c). TLB Energy Savings (PARSEC)

a b cARM

a b cX86

a. PTT-1-wayb. PTT-2-wayc. Basecase

Instruction Translation EnergyData Translation Energy

0

0.2

0.4

0.6

0.8

1

Nor

mal

ized

Ene

rgy

(d). TLB Energy Savings (SPEC2006 CPU)

a b cARM

a b cX86

a. PTT-1-wayb. PTT-2-wayc. Basecase

Instruction Translation EnergyData Translation Energy

Fig. 3. Sensitivity to Predictor organization

(2) Can a more sophisticated predictor (rather than the naive“predict the last outcome”) be developed to increase accuracy?(3) Can PreTrans completely replace the TLB? We leave theseissues for future work.

VIII. RELATED WORK

Many speculative techniques focused on performance usesome forms of address and/or value prediction. However,because we focus on speculation for energy-reduction, thereare some key differences as described below.

Prefetching fundamentally relies on address predic-

tion [13], [12], [10]. However, such prediction is across instruc-tions (i.e., one needs to predict addresses that will be accessedby future instructions). More importantly, PreTrans requirescoarse-grain (i.e., page grain) prediction which is enough fortranslation. In contrast, prefetching requires cacheblock-grainprediction.

Value prediction [21], [15] enables early prediction ofvalues rather than addresses. Even if values can be predicted inthe common case, the actual translation and cache access mustoccur to validate the speculation. Thus there is no reductionin energy costs, which is the focus of PreTrans.

Page 6: [IEEE 2013 IEEE International Symposium on Low Power Electronics and Design (ISLPED) - Beijing, China (2013.09.4-2013.09.6)] International Symposium on Low Power Electronics and Design

There exist research proposals for zero-cycle loads as amechanism to achieve fast data loads [5], [4]. While suchtechniques do move the cache access earlier in the pipeline,such early accesses need fast translation as well. Withoutfiltering TLB accesses, CAM accesses will still be used toachieve translation.

SpecTLB proposes speculative translations by using ad-dress interpolation on TLB misses in systems that allowsuperpage promotion [6]. In contrast, PreTrans speculates onthe TLB hit path. Thus, PreTrans and SpecTLB can coexist asthey target independent opportunities.

Recent work [7] has observed that selective virtual cachingat the L1 level can eliminate the need for translation for thecommon case, thus reducing TLB access energy. While we goafter the same opportunity (i.e., TLB energy savings), we doso in a software transparent way. In contrast, opportunisticvirtual caching requires software (OS) changes which maypose commercial barriers as chip vendors and OS vendors haveto cooperate for such changes to be widely adopted.

There is a wide spectrum of hardware/circuit mechanismsthat optimize the energy of TLB accesses such as [18],[19]. Such TLB energy optimizations are orthogonal to ourtechnique which reduces the number of TLB accesses.

Finally, the problem of translation goes away with virtualcaches. However, other practical problems arise in multi-address-space systems such as homonyms and synonyms.Various techniques have been proposed to address such prob-lems [11], [16], [22].

IX. CONCLUSION

This paper uses prediction and speculation to reduce theenergy/power consumed because of CAM-searches in TLBswithout negatively impacting performance. To that end, wedesign PreTrans which leverages the fact that that (1) pagenumbers are predictable based on base addresses in base-displacement addressing mode and (2) predicted pages maybe pre-translated by caching translations. In the common case,our speculative pretranslation, which uses ordinary (RAM)table lookups, eliminates the need for energy-expensive CAMlookups.

Compared to an L0 TLB, PreTrans achieves lower powerand better performance because (1) unlike L0 TLBs, it isaccessed speculatively before the true effective address iscomputed, and (2) PreTrans filters a higher fraction of energy-expensive L1 DTLB accesses and almost all of ITLB accesseswhile use lower-associativity lookups than typical L0 TLBs.Simulations reveal that PreTrans achieves TLB energy/powersavings of 90% (85%) on average.

REFERENCES

[1] Cortex-a15 mpcore processors. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0438g/index.html.

[2] Cortex-a9 series processors. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/Chddijbd.html.

[3] Intel 64 and ia-32 architectures software developer’s manual. 3A:399.[4] T. Austin, D. Pnevmatikatos, and G. Sohi. Streamlining data cache

access with fast address calculation. In Computer Architecture, 1995.Proceedings., 22nd Annual International Symposium on, pages 369–380, 1995.

[5] T. Austin and G. Sohi. Zero-cycle loads: microarchitecture support forreducing load latency. In Microarchitecture, 1995., Proceedings of the28th Annual International Symposium on, pages 82–92, 1995.

[6] T. W. Barr, A. L. Cox, and S. Rixner. Spectlb: a mechanism forspeculative address translation. In Proceedings of the 38th annualinternational symposium on Computer architecture, ISCA ’11, pages307–318, 2011.

[7] A. Basu, M. D. Hill, and M. M. Swift. Reducing memory referenceenergy with opportunistic virtual caching. In Proceedings of the 39thInternational Symposium on Computer Architecture, ISCA ’12, pages297–308, 2012.

[8] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmarksuite: Characterization and architectural implications. Technical ReportTR-811-08, Princeton University, January 2008.

[9] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator.SIGARCH Comput. Archit. News, 39(2):1–7, Aug. 2011.

[10] D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching.In Proceedings of the fourth international conference on Architecturalsupport for programming languages and operating systems, ASPLOSIV, pages 40–52, 1991.

[11] M. Cekleov and M. Dubois. Virtual-address caches. part 1: problemsand solutions in uniprocessors. Micro, IEEE, 17(5):64 –71, sep/oct1997.

[12] T.-F. Chen and J.-L. Baer. A performance study of software andhardware data prefetching schemes. In Computer Architecture, 1994.,Proceedings the 21st Annual International Symposium on, pages 223–232, 1994.

[13] W. Y. Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu. Data accessmicroarchitectures for superscalar processors with compiler-assisteddata prefetching. In Proceedings of the 24th annual internationalsymposium on Microarchitecture, MICRO 24, pages 69–73, 1991.

[14] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevd-jic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi. Clearing theclouds: a study of emerging scale-out workloads on modern hardware.In Proceedings of the seventeenth international conference on Archi-tectural Support for Programming Languages and Operating Systems,ASPLOS XVII, pages 37–48, New York, NY, USA, 2012. ACM.

[15] F. Gabbay and A. Mendelson. Using value prediction to increase thepower of speculative execution hardware. ACM TRANSACTIONS ONCOMPUTER SYSTEMS, 16:234–270, 1998.

[16] J. R. Goodman. Coherency for multiprocessor virtual address caches.SIGOPS Oper. Syst. Rev., 21(4):72–81, Oct. 1987.

[17] J. L. Henning. Spec cpu2006 benchmark descriptions. SIGARCHComput. Archit. News, 34(4):1–17, Sept. 2006.

[18] T. Juan, T. Lang, and J. J. Navarro. Reducing tlb power requirements.In Proceedings of the 1997 international symposium on Low powerelectronics and design, ISLPED ’97, pages 196–201, 1997.

[19] I. Kadayif, A. Sivasubramaniam, M. Kandemir, G. Kandiraju, andG. Chen. Generating physical addresses directly for saving instructiontlb energy. In Microarchitecture, 2002. (MICRO-35). Proceedings. 35thAnnual IEEE/ACM International Symposium on, pages 185 – 196, 2002.

[20] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen,and N. P. Jouppi. Mcpat: an integrated power, area, and timingmodeling framework for multicore and manycore architectures. InProceedings of the 42nd Annual IEEE/ACM International Symposiumon Microarchitecture, MICRO 42, pages 469–480, 2009.

[21] M. H. Lipasti and J. P. Shen. Exceeding the dataflow limit viavalue prediction. In Proceedings of the 29th annual ACM/IEEEinternational symposium on Microarchitecture, MICRO 29, pages 226–237, Washington, DC, USA, 1996. IEEE Computer Society.

[22] W. H. Wang, J.-L. Baer, and H. M. Levy. Organization and performanceof a two-level virtual-real cache hierarchy. In Proceedings of the 16thannual international symposium on Computer architecture, ISCA ’89,pages 140–148, 1989.