Accurate Memory Data Flow Modeling in Statistical Simulationleeckhou/papers/ics06_statsim.pdf · Accurate Memory Data Flow Modeling in Statistical Simulation ... modeling of memory

Accurate Memory Data Flow Modeling inStatistical Simulation

Davy Genbrugge Lieven Eeckhout Koen De BosschereELIS, Ghent University, Sint-Pietersnieuwstraat 41, B-9000 Gent, Belgium

{dgenbrug,leeckhou,kdb}@elis.UGent.be

ABSTRACTMicroprocessor design is a very complex and time-consuming ac-tivity. One of the primary reasons is the huge design space thatneeds to be explored in order to identify the optimal design givena number of constraints. Simulations are usually used to explorethese huge design spaces, however, they are fairly slow. Severalhundreds of billions of instructions need to be simulated per bench-mark; and this needs to be done for every design point of interest.

Recently, statistical simulation was proposed to efficiently culla huge design space. The basic idea of statistical simulation is tocollect a number of important program characteristics and to gen-erate a synthetic trace from it. Simulating this synthetic trace isextremely fast as it contains a million instructions only.

This paper improves the statistical simulation methodology byproposing accurate memory data flow models. We model (i) loadforwarding, (ii) delayed cache hits, and (iii) correlation betweencache misses based on path info. Our experiments using the SPECCPU2000 benchmarks show a substantial improvement upon cur-rent state-of-the-art statistical simulation methods. For example,for our baseline configuration we reduce the average IPC predictionerror from 10.7% to 2.3%. In addition, we show that performancetrends are predicted very accurately, making statistical simulationenhanced with accurate data flow models a useful tool for efficientand accurate microprocessor design space explorations.

Categories and Subject DescriptorsC.4 [Performance of Systems]: Modeling Techniques

General TermsExperimentation, Measurement, Performance

KeywordsPerformance Modeling, Statistical Simulation, Memory Data FlowModeling

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICS’06, June 28-30, Cairns, Queensland, AustraliaCopyright 2006 ACM 1-59593-282-8/06/0006 ...$5.00.

1. INTRODUCTIONDesigning a microprocessor is extremely time-consuming (up to

seven years [18]). Computer designers and architects heavily relyon simulation tools for exploring the huge design space. These sim-ulation tools are at least three or four orders of magnitude slowerthan real hardware execution. In addition, architects and designersuse long-running benchmarks that are built from real-life applica-tions; today’s benchmarks have several hundreds of billions of dy-namically executed instructions. The end result is that simulating asingle benchmark can take days to weeks. And this is to simulatea single microarchitectural configuration. As a result, exploring ahuge design space in order to find an optimal trade-off betweenperformance, cost, power consumption, etc., is impossible throughdetailed simulation.

This is a well recognized problem and several researchers haveproposed solutions to this problem such as reduced input sets [16],statistical sampling [5, 25], targetted sampling [23] and analyticalmodeling [15]. In this paper we address another approach, namelystatistical simulation [4, 6, 7, 8, 19, 20, 21]. The basic idea of statis-tical simulation is as follows. A number of program characteristicsare measured from a real program execution in a so called statisticalprofile. A statistical profile contains various program characteris-tics such as the statistical control flow graph, instruction mix dis-tribution, inter-operation dependency distribution, cache miss info,branch miss info, etc. A synthetic trace is then generated from thisstatistical profile which is then simulated on a simple trace-drivenstatistical simulator. The main advantage of statistical simulationis that the synthetic trace is very small—only one million of in-structions at most. By consequence, simulating a synthetic traceis done very fast. This property makes this technique an excellenttechnique to complement the other tools a computer designer hasto his disposal when designing a microprocessor [8]. For example,statistical simulation could be used to efficiently cull a huge de-sign space, after which more detailed simulations evaluate a muchsmaller region of interest [9]. Note that the goal of statistical simu-lation is not to replace detailed simulation but to make quick perfor-mance estimates early in the design process with little developmenttime.

Previous work on statistical simulation however considers sim-ple memory data flow modeling. The statistics considered in previ-ously proposed statistical simulation approaches show three impor-tant shortcomings. First, statistical simulation typically assigns hitsand misses to loads. When a load hit is simulated in the statisticalsimulator, the load gets the access latency assigned for the givencache level. When a load miss is simulated, the load gets assignedthe access latency of the next cache level. By consequence, de-layed hits are not modeled. A delayed hit is a load hit referencingthe same cache line of a pending memory reference; the load hit

then sees a part of the latency of the pending cache line. Second,none of the previously proposed statistical simulation approachesadequately model load bypassing and load forwarding, i.e., it is as-sumed that loads never alias with preceding stores. Third, none ofthis prior work models the correlation that may exist between cachemisses. Cache miss correlation results in specific cache miss pat-terns that have an important impact on overall performance. Notmodeling these cache miss patterns can lead to inaccurate perfor-mance estimates. In this paper, we address all three shortcomingsby proposing how to extend statistical simulation for dealing with(i) delayed hits, (ii) load forwarding and (iii) cache miss correla-tion. We show that it is important to accurately model memorydata flow and we provide techniques of how to achieve that. Ourexperimental results using the SPEC CPU2000 benchmarks showthat accurate memory data flow modeling reduces the average per-formance prediction error from 10.7% down to 2.3%. The maxi-mum prediction error observed with memory data flow modelingis 12.7% whereas no memory data flow modeling could result inprediction errors up to 68%. Next to improving performance pre-diction accuracy in a single design point, we also show that mem-ory data flow improves performance trend prediction which is ex-tremely important for design space explorations.

This paper is organized as follows. We first revisit prior workdone on statistical simulation. We then describe the statistical sim-ulation methodology in detail in section 3. Section 4 then describesthe contributions made in this paper, i.e., it discusses how to accu-rately model memory data flow. Section 5 details the experimentalsetup. We then evaluate how memory data flow modeling improvesstatistical simulation in section 6. Finally, we conclude in section 7.

2. PRIOR WORKStatistical simulation received increased interest over the recent

years. Noonburg and Shen [19] proposed to model a program ex-ecution as a Markov chain in which the states are determined bythe microarchitecture and the transition probabilities by the pro-gram. Extending this approach to large-resource out-of-order ar-chitectures is infeasible because of the exploding complexity of theMarkov chain.

More recently, a number of papers have been published on thestatistical simulation framework as we envision it in this paper.The idea is to collect a number of important program character-istics and to generate a synthetic trace from it that is then simulatedon a simple trace-driven statistical simulator. The initial modelsproposed along this approach were fairly simple [4, 7] in the sensethat mostly aggregate statistics were used to model the program ex-ecution; these approaches did not model characteristics at the basicblock level. Oskin et al. [21] proposed the notion of a graph withtransition probabilities between the basic blocks. The graph how-ever was built up from aggregate statistics. They also showed howstatistical simulation can be used to easily explore the workloadspace by varying the statistical profile. Nussbaum and Smith [20]correlated various program characteristics to the basic block size.Eeckhout et al. [6] proposed the notion of the statistical flow graph(SFG) which models the control flow in a statistical manner. Thevarious program characteristics are then correlated to the SFG.

Iyengar et al. [12, 13] take a different approach in SMART bygenerating synthetic program traces using the notion of a fully qual-ified instruction. A fully qualified instruction is an instruction alongwith its context. The context of a fully qualified instruction consistsof its n preceding singly qualified instructions. A singly qualifiedinstruction is an instruction along with its instruction type, I-cachebehavior, TLB-behavior, and if applicable, its branching behaviorand D-cache behavior. SMART thus makes a distinction between

two fully qualified instructions that have the same history of pre-ceding instructions, however, they differ in a singly qualified in-struction; that singly qualified instruction can be a cache miss inone case while being a hit in another case. Modeling a programexecution using fully qualified instructions obviously requires a lotof memory space to collect the statistical profile. The authors alsoreport that for some benchmarks, information needed to be erasedfrom the statistical profile in order not to exceed the amount ofmemory available in the machine they did their experiments on.Another important distinction we would like to make concerningthis prior work is that Iyengar et al. generate synthetic addressstreams rather than cache hits and misses. This makes our syn-thetic traces shorter than theirs because warming up large cacheswith memory references requires more instructions than the onemillion instructions we consider in our synthetic traces.

Recent work also focused on generating synthetic benchmarksrather than synthetic traces. Hsieh and Pedram [11] generate a fullyfunctional program from a statistical profile. However, the statis-tical profile only contains microarchitecture-dependent character-istics which makes this technique useless for design space explo-rations. Bell and John [1] generate short synthetic benchmarks us-ing a collection of microarchitecture-independent and microarchi-tecture-dependent characteristics similar to what is done in statis-tical simulation. Their goal is performance model validation usingsmall but representative synthetic benchmarks.

None of this prior work considered the accurate statistical mod-eling of the memory data flow. All of these approaches use simplecache hit/miss probabilities to model cache behavior. In this paperwe show that the accurate modeling of memory data flow requiresmore advanced modeling. Before detailing how we improve mem-ory data flow modeling, we first describe the statistical simulationmethod.

3. STATISTICAL SIMULATIONStatistical simulation consists of three steps as shown in Figure 1.

We first measure a statistical profile which is a collection of im-portant program execution characteristics such as instruction mix,inter-instruction dependencies, branch miss behavior, cache missbehavior, etc. Subsequently, this statistical profile is used to gen-erate a synthetic trace consisting of only one million instructions.In the final step, this synthetic trace is simulated on a statisticalsimulator which yields us performance metrics such as IPC. In thefollowing subsections, we discuss all three steps.

3.1 Statistical profilingIn statistical profiling we make a distinction between microarchi-

tecture-independent characteristics and microarchitecture-dependentcharacteristics. The microarchitecture-independent characteristicscan be used across microarchitectures during design space explo-ration. The microarchitecture-dependent characteristics on the otherhand are particular to specific (subparts of) a microarchitecture.

3.1.1 Statistical flow graphThe key structure in the statistical profile is the statistical flow

graph (SFG) [6] which represents the control flow in a statisticalmanner. In an SFG, the nodes are the basic blocks along with theirbasic block history, i.e., the basic blocks being executed prior to thegiven basic block. The order of the SFG is defined as the length ofthe basic block history, i.e., the number of predecessors to a basicblock in each node of the SFG. The order of an SFG will be denotedwith the symbol k throughout the paper. For example, considerthe following basic block sequence ‘ABBAABAABBA’. The 4th-order SFG then makes a distinction between basic block ‘A’ given

be

nch

ma

rk

microarchitecture-independentprofiling tool

specializedsimulation

of locality events

statistical profile containing:- statistical flow graph- instruction types- number of operands per instruction- dependency distance per operand- branch characteristics- cache characteristics

synthetic trace

synthetic trace simulation

performance metrics

synthetic trace generation

1

2

3

Figure 1: Statistical simulation: general framework.

its basic block history ‘ABBA’, ‘BAAB’, ‘AABA’, ‘AABB’; thisSFG will thus contain the following nodes: ‘A|ABBA’, ‘A|BAAB’,‘A|AABA’ and ‘A|AABB’. The edges in the SFG interconnectingthe nodes represent transition probabilities between the nodes. Theidea behind the SFG is to model all the other program characteris-tics along the nodes of the SFG. This allows for modeling programcharacteristics that are correlated with path behavior. This meansthat for a given basic block, different statistics are computed fordifferent basic block histories. For example in a 4th-order SFG,in case the basic block history for basic block ‘A’ is ‘ABBA’, theprobability for a cache miss might be different from the case wherethe basic block history for ‘A’ is ‘BAAB’. Such cases can be mod-eled in a 4th-order SFG. On the other hand, in case a correlationbetween program characteristics spans a number of basic blocksthat is larger than the SFG’s order, it will be impossible to modelsuch correlations within the SFG, unless the order of the SFG isincreased.

3.1.2 Microarchitecture-independent characteristicsThe first microarchitecture-independent characteristic is the in-

struction mix. We classify the instruction types into 12 classes ac-cording to their semantics: load, store, integer conditional branch,floating-point conditional branch, indirect branch, integer alu, inte-ger multiply, integer divide, floating-point alu, floating-point mul-tiply, floating-point divide and floating-point square root. And foreach instruction we record the number of source operands. Notethat some instruction types, although classified within the same in-struction class, may have a different number of source operands.

For each operand we also record the dependency distance whichis the number of dynamically executed instructions between theproduction of a register value (register write) and the consump-tion of it (register read). We only consider read-after-write (RAW)dependencies since our focus is on out-of-order architectures inwhich write-after-write (WAW) and write-after-read (WAR) depen-dencies are dynamically removed through register renaming as longas enough physical registers are available. Note that recording thedependency distance requires storing a distribution since multipledynamic versions of the same static instruction could result in mul-tiple dependency distances. Although very large dependency dis-tances can occur in real program traces, we can limit this depen-dency distribution for our purposes to the maximum reorder buffersize we want to consider during statistical simulation. This howeverlimits the number of in-flight instructions that can be modeled. Inour study, we limit the dependency distribution to 512 which stillallows the modeling of a wide range of current and near-future mi-croprocessors.

3.1.3 Microarchitecture-dependent characteristicsIn addition to these microarchitecture-independent characteris-

tics we also measure a number of microarchitecture-dependent char-acteristics that are related to locality events. The reason for choos-ing to model these events in a microarchitecture-dependent wayis that locality events are hard to model using microarchitecture-independent metrics. We therefore take a pragmatic approach andcollect cache miss and branch miss info for particular cache con-figurations and branch predictors.

For the branch statistics we consider (i) the probability for ataken branch, (ii) the probability for a fetch redirection (target mis-prediction in conjunction with a correct taken/not-taken predictionfor conditional branches), and (iii) the probability for a branch mis-predict. When measuring the branch statistics we consider a FIFObuffer as described in [6] in order to model delayed branch predic-tor update.

The cache statistics consist of the following six probabilities: (i)the L1 I-cache miss rate, (ii) the L2 cache miss rate due to instruc-tions only1, (iii) the L1 D-cache miss rate, (iv) the L2 cache missrate due to data accesses only, (v) the I-TLB miss rate and (vi) theD-TLB miss rate.

We want to re-emphasize that all the program characteristics dis-cussed above, both the microarchitecture-dependent and -independentcharacteristics, are measured in the context of an SFG. This meansthat separate statistics are kept for different basic block histories orexecution paths.

3.2 Synthetic trace generationThe second step in the statistical simulation methodology is to

generate a synthetic trace from the statistical profile. The synthetictrace generator takes as input the statistical profile and outputs asynthetic trace that is fed into the statistical simulator. Synthetictrace generation uses random number generation for generating anumber in [0,1]; this random number is then used with the cu-mulative distribution function to determine a program character-istic. The synthetic trace is a linear sequence of synthetic instruc-tions. Each instruction has an instruction type, a number of sourceoperands, an inter-instruction dependency for each source operand(which describes the producer for the given source operand), I-cache miss info, D-cache miss info (in case of a load), and branchmiss info (in case of a branch). The locality miss events are justlabels in the synthetic trace describing whether the load is an L1D-cache hit, L2 hit or L2 miss and whether the load generates aTLB miss. Similar labels are assigned for the I-cache and branchmiss events.

1We assume a unified L2 cache. However, we make a distinctionbetween L2 cache misses due to instructions and due to data.

a new statistical profile is neededbranch predictor (type and size)cache hierarchy (number of cache levels, size, associativity, line size, replacement policy, line updating policy)

no new statistical profile is needed

processor width (fetch, decode, dispatch, issue and commit width)pipeline depth (front-end pipeline depth)ROB sizeLSQ sizefetch buffer sizenumber and types of the functional unitsinstruction execution latenciesmemory hierarchy (L1, L2 and DRAM) access latencies

Table 1: Example microarchitectural parameters that do or do not require that a new statistical profile is computed.

3.3 Synthetic trace simulationSimulating the synthetic trace is fairly straightforward. In fact,

the synthetic trace simulator itself is very simple as it does not needto model branch predictors nor cache hierarchies; also, all the ISA’sinstruction types are collapsed in a limited number of instructiontypes. Synthetic trace simulation differs from conventional archi-tectural simulation in the following cases:

• On a branch mispredict, synthetic instructions are fed intothe pipeline as if they were from the correct path. When thebranch is resolved, the pipeline is squashed and refilled withsynthetic instructions from the correct path. This is to modelresource contention in case of a branch mispredict. Note thatthe statistical profile mentioned above does not consider off-path instructions; those statistics only concern on-path in-structions.

• For executing a load instruction, the load’s latency is deter-mined based on the label it has. An L1 hit gets assignedthe L1 access latency, an L2 hit gets assigned the L2 accesslatency and an L2 miss gets assigned the memory access la-tency.

• Similar actions are undertaken for the I-cache. In case ofan L1 miss, the fetch engine stops fetching for a number ofcycles equal to the L2 access latency, etc.

The important benefit of statistical simulation is that the syn-thetic traces are fairly short. The performance metrics such as IPCquickly converge to a steady-state value when simulating a syn-thetic trace. As such, synthetic traces containing a million of in-structions are sufficient for obtaining stable and accurate perfor-mance estimations.

3.4 Discussion on applicabilityBefore presenting our improved memory data flow modeling ap-

proach we first would like to discuss the use of (a number of)microarchitecture-dependent characteristics in the statistical pro-file. This requires that whenever a new branch predictor or a newcache hierarchy is to be considered during design space explo-ration, a new statistical profile needs to be measured. To addressthis issue, techniques can be used for measuring cache profiles formultiple caches simultaneously in a single profiling run [24]. Sincea statistical profile also contains a number of microarchitecture-independent characteristics, a very large number of microarchi-tectural parameters can still be varied during design space explo-ration without requiring the computation of a new statistical pro-file, see Table 1. As such, statistical simulation can yield substan-tial speedups during design space exploration. Note also that thememory data flow enhancements proposed in this paper do not af-fect the general applicability of the statistical simulation approach;i.e., the same design space can still be explored from a single statis-tical profile, however, we now achieve more accurate performancepredictions.

4. MEMORY DATA FLOW MODELINGAs mentioned in the introduction, previously proposed statisti-

cal simulation approaches did not consider advanced memory dataflow statistics. These approaches basically keep track of aggre-gate cache hit/miss information. In this section we detail the threeadditional features that we propose in this paper: (i) cache misscorrelation, (ii) load forwarding and (iii) delayed hits.

4.1 Cache miss correlationThe first important memory data flow characteristic that we model

is cache miss correlation. Cache miss correlation refers to the factthat the cache miss behavior of a particular memory operation (loador store) can be highly correlated with the cache miss behavior of(a) preceding memory operation(s). Consider for example a loopthat walks over an array. Each element in the array is 8 bytes longand a cache line is 32 bytes long. As a result, in case the array is notresiding in the cache, a cache miss will occur every four iterationsof the loop. This cannot be modeled accurately in previously pro-posed statistical simulation frameworks. All iterations of this sameloop will collapse in a single number, namely the cache miss rate(which is 75%) for that particular static load in the loop, i.e., nodistinction is made between different iterations of the loop. Eventhe statistical flow graph is incapable of modeling this behavior ac-curately because the basic block history for that load is always thesame sequence of basic blocks from the loops itself. This resultsin aggregate statistics that average over multiple executions of thesame load which, in its turn, results in inaccurate modeling of over-lapping cache misses in the synthetic trace.

When modeling cache miss correlation we collect separate cachemiss rate statistics per static memory operation conditionally de-pendent on the memory operation’s global cache miss history. Theglobal cache miss history is a concatenation of the most recentcache hit/miss outcomes. In the above example where a loop walksover an array, cache miss correlation allows for making a distinc-tion between the load operation that results in a cache miss and theother load operations that result in cache hits. The global cachemiss history for the load miss looks like ‘0111’ where a ‘0’ denotesa cache miss and a ‘1’ denotes a cache hit. The probability for acache miss for that static load given its global cache miss history‘0111’ then equals 100%. The probability for a cache hit for thatsame static load then is 0% for the other global cache miss histo-ries that occur for that load, namely ‘1011’, ‘1101’ and ‘1110’. Bydoing so, a more accurate statistical profile is collected, and a morerepresentative synthetic trace can be generated.

An important choice that needs to be made for modeling cachemiss correlation is how deep the global cache miss history shouldbe. In our implementation we choose the global cache miss historyas deep as the number of preceding loads/stores in the basic blockhistory. This means that in a k-th order SFG, all the precedingload/store hit/miss outcomes in the k-deep basic block history (notjust the hit/miss outcomes of the given static memory operation)serve as a history for the current memory operation hit/miss prob-

program trace execution scenario

scenario 1 scenario 2 scenario 3

…

x: ld A

…

y: ld A

…

(cache miss)

t t t

x

y

x

y

y

x

Figure 2: Modeling delayed hits in the synthetic trace simulator.

ability. By doing so, we correlate the current load/store hit/missoutcome with the preceding hit/miss outcomes from the k-deep ba-sic block history.

During synthetic trace generation we then use the cache misscorrelation statistics for driving the generation of synthetic cachemisses. When a hit/miss outcome needs to be determined, theglobal hit/miss outcome generated so far is used to search the cachemiss correlation statistic for the given memory operation. The hit/missprobability corresponding to the best matching hit/miss history forthe given memory operation is then used for determining whetherthe memory operation will cause a hit or a miss.

4.2 Load forwardingOut-of-order execution of memory operations is an important

source for performance gain in out-of-order microprocessors. Thegoal is to execute load instructions as soon as possible (as soonas their source operands are ready) provided that read-after-write(RAW) dependencies through memory are respected. By doing so,load instructions possibly get executed before preceding store in-structions.

Early out-of-order execution of loads is achieved in out-of-ordermicroprocessors through two techniques, load bypassing and loadforwarding [14]. Load bypassing refers to executing a load earlierthan preceding stores; this is possible provided that the load addressdoes not alias with those stores. In case the load aliases with apreceding store, i.e., there is a RAW dependency, load forwardingallows the load to retrieve its data directly from the store withoutaccessing the memory hierarchy.

Modeling load bypassing and load forwarding can be done instatistical simulation by measuring the RAW memory dependencydistribution. This distribution quantifies the probability that a loadaliases with a preceding store j−1, j−2, j−3, etc. Again, this dis-tribution is measured on a per-instruction basis in the context of theSFG. In the synthetic trace, RAW dependencies through memoryare then marked between the memory operations. During synthetictrace simulation, this information is used to determine the schedul-ing of memory operations, i.e., to determine whether or not loadbypassing or load forwarding is possible. Note that under load by-passing, alias checking must be done for completing stores againstexecuted loads. If aliasing is detected, the load and its dependentinstructions must be reissued. This may come at a performancepenalty. All of this can be modeled in statistical simulation.

4.3 Delayed hitsThe caches in contemporary microprocessors typically are non-

blocking caches [10, 17]. Non-blocking caches allow for over-lapping cache misses by putting aside load misses while servicingother load instructions. These in-overlap serviced load instructions

can also be cache misses. Non-blocking caches have an importantimpact on overall performance. As such, it is important to modelits impact on performance adequately.

In current statistical simulation frameworks however, only cachehits and cache misses are considered. In case of non-blockingcaches, load instructions can see latencies that are different fromthe L1 access latency, L2 access latency and main memory accesslatency. Consider for example the case where a load accesses cacheline A at time t100 and this is a cache miss to L2; the load thusfinishes execution at time t120 in case the L2 access latency is 20cycles. Assume now another load accessing the same cache lineat time t107; this load will then see a load execution latency of 13cycles. The latter load then is a delayed hit or a secondary miss.Current statistical simulation frameworks will consider the delayedhit as a hit and will assign the L1 access latency to this load whichis a serious underestimation of the load’s execution latency.

In order to model delayed hits within statistical simulation, wecompute the missed cache line reuse distance or the number ofmemory references between two memory references accessing thesame cache line, of which the first memory reference in the dy-namic instruction stream is a cache miss. Note that this is mea-sured per instruction depending on the basic block history (throughthe SFG). Since an instruction may have multiple missed cache linereuse distances depending on the instruction’s basic block history,we in fact measure a distribution of the missed cache line reuse dis-tance. Note that the missed cache line reuse distance is measuredfor both load and store operations; this allows for modeling delayedhits for various cache write policies (write-back and write-through)and cache allocation policies (write-allocate and write non-allocate).An additional optimization that we explored is to measure the missedcache line reuse distance distribution conditionally on the cachemiss correlation info. As such, we are able to more accuratelymodel delayed hits based on global cache miss history information.This was beneficial for the accurate modeling of several bench-marks as will be shown in the evaluation section.

In order to model delayed hits in the statistical simulation frame-work slight modifications need to be made to the synthetic tracesimulator. This is illustrated in Figure 2. Consider the programtrace shown on the left; we have a load miss x to cache line Afollowed by a load hit y to the same cache line. There are threepossible scenarios that need to be modeled in the synthetic tracesimulator:

(1) load x has finished its execution when load y is issued. Loady then gets assigned the L1 access latency. This scenario isaccurately modeled in existing statistical simulation frame-works.

(2) load x is still executing when load y is issued. Load y then

benchmark input simpointbzip2 program 9crafty ref 0eon rushmeier 18gap ref 2,094gcc 166 99gzip graphic 9mcf ref 316parser ref 16perlbmk makerand 1

benchmark input simpointtwolf ref 31vortex ref2 57vpr route 71ammp ref 2,130applu ref 18apsi ref 46art ref-110 67equake ref 194facerec ref 136

benchmark input simpointfma3d ref 298galgel ref 3,150lucas ref 35mesa ref 89mgrid ref 6sixtrack ref 82swim ref 5wupwise ref 584

Table 2: The SPEC CPU2000 benchmarks, their reference inputs and the single 100M simulation points used in this paper.

baseline config 1 config 2 config 3 config 4 config 5 config 6 config 7 config 8ROB/LSQ 128/32 32/16 32/16 32/16 32/16 128/64 128/64 128/64 128/64processor width 8 4 4 4 4 8 8 8 8I-cache 8KB 8KB 8KB 32KB 32KB 8KB 8KB 32KB 32KBD-cache 16KB 16KB 16KB 64KB 64KB 16KB 16KB 64KB 64KBL2 cache 1MB 1MB 1MB 4MB 4MB 1MB 1MB 4MB 4MBlatencies (L1/L2/MEM) 2/20/150 2/20/300 2/20/300 4/30/300 4/30/300 2/20/300 2/20/300 4/30/300 4/30/300entries in I-/D-TLB 32/32 32/64 32/64 64/128 64/128 32/64 32/64 64/128 64/128hybrid branch predictor 8K-entry 2K-entry 8K-entry 2K-entry 8K-entry 2K-entry 8K-entry 2K-entry 8K-entry

Table 3: Processor models used in this paper.

gets assigned the remaining execution latency for load x.

(3) load x is not yet executing when load y is issued—this ispossible because of out-of-order execution. Load y is thenturned into a cache miss and thus gets assigned the next cachelevel’s access latency. Load x which is issued later on, thengets assigned the remaining execution latency of the resolv-ing load y.

The two latter scenarios need special support in the synthetic tracesimulator for accurate memory data flow modeling.

5. EXPERIMENTAL SETUPWe use SimpleScalar/Alpha v3.0 in our experiments; and we use

Wattch [3] for estimating energy consumption which will be usedfor searching the most energy-efficient microarchitectural config-uration in a large design space. The benchmarks along with theirreference inputs used in this study are the SPEC CPU 2000 bench-marks, see Table 2. The binaries of these benchmarks were takenfrom the SimpleScalar website.2 We considered 100M single (andearly) simulation points as determined by SimPoint [22, 23] in allof our experiments, unless stated otherwise, i.e., we also evaluatestatistical simulation on longer running 10B instruction sequences.

The processor models we use in this paper are given in Table 3;the baseline configuration is shown along with eight other config-urations. These configurations vary in their processor core, branchpredictor and memory hierarchy. The reason for considering mul-tiple configurations is to evaluate the performance prediction ac-curacy of statistical simulation over multiple points in the designspace.

6. EVALUATIONWe now evaluate our approach to memory data flow modeling

in the context of statistical simulation. We first quantify the simu-lation speed of our improved statistical simulation framework. Wethen quantify the performance prediction accuracy and how it im-proves through accurate memory data flow modeling. We subse-quently measure how well the improved statistical simulation ap-proach can predict performance trends, i.e., we evaluate the rela-tive accuracy and its ability for driving design space explorations.

2http://www.simplescalar.com

Finally, we also quantify the storage requirements of the statisticalprofiles.

6.1 Simulation speedAs stated before, an important feature of statistical simulation is

its simulation speed. Performance characteristics quickly convergeto a steady-state value due to the statistical nature of the approach.To demonstrate the fast simulation speed, we have done the fol-lowing experiment. We have generated 20 synthetic traces froma single statistical profile using 20 random seeds in the synthetictrace generator. We then compute the CoV (coefficient of variation)which is defined as the standard deviation divided by the mean IPCvalue over those 20 synthetic traces. The CoV we observe is lessthan 1% for all benchmarks. We thus conclude that statistical sim-ulation indeed is a very fast simulation technique; 1M instructiontraces are sufficient for obtaining converged performance predic-tions.

6.2 Performance prediction accuracyWe now evaluate the performance prediction accuracy for statis-

tical simulation enhanced with memory data flow modeling.

6.2.1 Baseline configurationFigures 3 and 4 show the procentual performance prediction er-

ror for the baseline processor configuration for the integer and floating-point benchmarks, respectively. The IPC prediction error is com-puted as

IPC prediction error =IPCstat sim − IPCdet sim

IPCdet sim

,

with IPCstat sim and IPCdet sim the IPC for statistical simula-tion and detailed simulation, respectively. A positive error reflectsan overestimation whereas a negative error reflects an underestima-tion. Figures 3 and 4 show six bars per benchmark:

• The prior work bar corresponds to previously proposed state-of-the-art statistical simulation approaches—this is the statis-tical simulation framework including the SFG as describedin [6];

• The second bar corresponds to the SFG enhanced with loadforwarding;

-10%

-5%

0%

5%

10%

15%

20%

bzip

2

cra

fty

eon

gap

gcc

gzip

mcf

pars

er

perlbm

k

twolf

vort

ex

vpr

avg

IPC

pre

dic

tion

err

or

prior work

load forwarding

cache miss correlation

delayed hits

all, no conditional distr

memory data flow modeling

68%69%

75%

Figure 3: IPC prediction error for SPECint2000 and the baseline processor configuration: evaluating the accuracy of the proposedmemory data flow modeling.

-20%

-15%

-10%

-5%

0%

5%

10%

15%

20%

am

mp

applu

apsi

art

equake

facere

c

fma3d

galg

el

lucas

mesa

mgrid

six

track

sw

im

wupw

ise

avg

IPC

pre

dic

tion

err

or

prior work

load forwarding

delayed hits

cache miss correlation

all, no conditional distr

memory data flow modeling

35%42%

44%

-34% -22% -32%

Figure 4: IPC prediction error for SPECfp2000 and the baseline processor configuration: evaluating the accuracy of the proposedmemory data flow modeling.

• The third bar shows the SFG enhanced with cache miss cor-relation;

• The fourth bar shows the SFG enhanced with delayed hitmodeling;

• The fifth bar shows the SFG enhanced with all three enhance-ments: delayed hits, load forwarding and cache miss correla-tion, however, the missed cache line reuse distance distribu-tion is not measured conditionally on the cache miss correla-tion info;

• The final bar shows the SFG enhanced with memory dataflow modeling. This includes delayed hits, load forwardingand cache miss correlation; in addition, the missed cache linereuse distance distribution is measured conditionally on thecache miss correlation info.

Several interesting observations can be made from this graph. First,the impact of modeling load forwarding is rather small. Previouswork, as mentioned before, assumes that a load never aliases witha preceding store; a load can thus execute as soon as its sourceoperands are available. When modeling load forwarding, the IPCprediction for statistical simulation can either increase or decrease.The IPC prediction increases when load instructions only see a one-cycle execution latency getting the store’s data from the store bufferor load/store queue; going to the L1 cache takes 2 cycles in oursetup. The IPC prediction decreases when a load has to wait for theprior store to be executed. This explains the small changes in IPC

prediction error due to modeling load forwarding. Note that in oursimulations we do not account for a performance penalty in casethe store-load dependency is violated and instructions need to bere-issued. In case a performance penalty needs to be accounted for,the importance of modeling load forwarding is likely to increase. Asecond observation we can make is that modeling cache miss corre-lation makes a big difference for a number of benchmarks, see forexample gcc, applu, galgel and wupwise. For these benchmarks,performance prediction accuracy is greatly influenced by model-ing cache miss correlation. Third, modeling delayed hits also de-creases the prediction error. The benchmarks that benefit the mostfrom delayed hit modeling are mcf, twolf, ammp, equake, fac-erec and swim. A fourth observation is that several benchmarksbenefit from modeling the cache line reuse distribution condition-ally on the cache miss correlation information; examples are twolf,applu, and lucas. When putting it all together, see the rightmostbars in Figures 3 and 4, the end result is a highly accurate statisti-cal simulation framework. The average prediction error goes downfrom 10.7% for prior work3 to 2.3% in this paper—the average er-rors in this paper are computed from absolute errors. The maxi-mum error is observed for ammp (12.7%) which is substantiallylower than the high errors observed for prior statistical simulationapproaches, see for example mcf (68%). (Note that even withoutthe outlier mcf the average IPC prediction error goes down from

3Note that this average error is higher than the error reported in [6];this is because [6] only considered a subset of the SPEC CPU2000benchmarks.

-3%

-2%

-1%

0%

1%

2%

3%

4%

5%

6%

bzip

2

cra

fty

eon

gap

gcc

gzip

pars

er

twolf

vort

ex

vpr

am

mp

applu

apsi

art

equake

facere

c

fma3d

galg

el

lucas

mesa

mgrid

six

track

sw

im

wupw

ise

IPC

pre

dic

tion

err

or

Figure 5: IPC prediction errors for 10B instruction sequences.

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

baselin

e

config

1

config

2

config

3

config

4

config

5

config

6

config

7

config

8

avg

IPC

pre

dic

tion

err

or

SPECint: prior workSPECfp: prior workSPECint: memory data flow modelingSPECfp: memory data flow modeling

Figure 6: Average IPC prediction errors for the eight processorconfigurations from Table 2 as obtained from prior work [6].

8.5% to 2.3%.)

6.2.2 Long instruction sequencesAll of the above results were done on relatively short 100M in-

struction sequences using SimPoint. Figure 5 shows the IPC pre-diction error for the baseline configuration on 10B instruction se-quences (after skipping the first 1B instructions). These resultsshow that statistical simulation with enhanced memory data flowmodeling is also very accurate for long instruction sequences witherrors varying between -2% and 5.5%.

6.2.3 Other processor configurationsThe IPC prediction errors discussed above are for the baseline

processor configuration given in Table 3. We now show resultsfor the other configurations mentioned in Table 3. Figure 6 showsaverage IPC prediction errors for the other eight processor con-figurations, and compares prior work against statistical simulationenhanced with memory data flow modeling as decribed in this pa-per. We observe that the errors drastically reduce through accuratememory data flow modeling. The average IPC prediction error forprevious work varies between 7.3% and 16.2% depending on theprocessor configuration; without mcf, the average error varies be-tween 5.1% and 10.6%. With accurate memory data flow modelingthe average error is smaller than 4.1%.

6.2.4 Impact of the order of the SFGFigure 7 quantifies the impact on IPC prediction error of the

SFG’s order k. This graph shows the IPC prediction error for dif-ferent values of k for the various benchmarks. We observe that,

0%

1%

2%

3%

4%

5%

6%

7%

3 4 5 6 7 8 9 10 11

order k of the SFG

avg

IPC

pre

dic

tion

err

or

Figure 7: IPC prediction error as a function of the SFG’s orderk; this is for processor configuration 5.

as expected, the IPC prediction error decreases with increasing k.There are basically two reasons why the accuracy improves withincreasing k. For one, as stated in [6], a higher order SFG incorpo-rates path information into the statistical profile. Benchmarks forwhich program characteristics correlate well with this path infor-mation benefit from this. However, this effect is rather limited asdiscussed in [6]. A second reason is that a higher order SFG alsoimplies that a longer global cache miss history is taken into accountfor modeling the cache miss correlation as explained in section 4.It is also interesting to note that previous work [6] concluded thata first-order SFG is sufficient for accurate performance modeling;higher-order SFGs do not yield better accuracy. The results pre-sented here show that this is no longer true when cache miss cor-relation is considered in conjunction with higher-order SFGs. Fig-ure 7 shows that the error stabilizes past k = 8 and k = 10; all theother results presented in this paper are for k = 10.

6.3 Sensitivity to cache hierarchy parametersWe now evaluate the ability of the proposed statistical memory

data flow model to track performance differences between cachehierarchy designs. This is to show that the memory data flow modelthat we propose is accurate enough over a wide variety of cachehierarchy designs.

6.3.1 Cache line sizeIn our first experiment we vary the cache line size, see Figure 8.

The average IPC is shown as a function of cache line size for bothdetailed and statistical simulation. In spite of the (small) absolute

1.02

1.04

1.06

1.08

1.1

1.12

1.14

1.16

32 64 128

cache line size

avera

ge

IPC

detailed simulation

statistical simulation

Figure 8: Estimating IPC as a function of cache line size. Thisis for processor configuration 7.

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

write-back; write-

allocate

write-back; write

non-allocate

write-through;

write-allocate

write-through;

write non-

allocate

IPC

detailed simulation


Figure 9: Estimating IPC under four cache line updating poli-cies. This is for processor configuration 7.

IPC prediction error, statistical simulation accurately tracks the rel-ative performance differences.

6.3.2 Cache line updatingIn our second experiment we consider two cache write policies

(write-back and write-through) and cache allocation policies (write-allocate and write non-allocate). Figure 9 shows the average IPCfor all the SPEC CPU benchmarks for the four cache line updatingpolicies as obtained through detailed simulation and statistical sim-ulation enhanced with memory data flow modeling. We concludethat statistical simulation is accurate enough for tracking the smallperformance differences.

6.3.3 Varying the number of MSHRsFigure 10 shows IPC as a function of the Miss Status Hold-

ing Register (MSHR) configuration for the art benchmark; we ob-served similar results for the other benchmarks. Both the numberof MSHR entries and the number of targets per entry is varied. Sta-tistical simulation is able to accurately track the performance dif-ferences caused by a varying MSHR configuration.

6.3.4 Varying the size of the store bufferFigure 11 shows IPC as a function of the number of entries in the

store buffer for the applu benchmark. The store buffer holds com-pleted stores that still need to be retired, i.e., the value still needsto be written to the memory hierarchy although the store already isarchitecturally completed. Again, statistical simulation is capableof accurately tracking performance differences with varying storebuffer sizes.

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

4-

4

4-

8

4-

16

8-

4

8-

8

8-

16

16

-4

16

-8

16

-16

MSHR configuration: no. of entries -- no. of targets

IPC

detailed simulation


Figure 10: Varying the MSHR configuration for art.

0.50

0.52

0.54

0.56

0.58

0.60

0.62

0.64

0.66

0.68

0.70

8 16 32 64

number of store buffer entriesIP

C

detailed simulation


Figure 11: Varying the number of store buffer entries for ap-plu.

6.4 Design space explorationWe now evaluate the ability of statistical simulation to iden-

tify the optimal design point in a given design space. The opti-mal design point is defined here as the design point with the mini-mum energy-delay-square product (ED2P), i.e., ED2P = CPI2×EPI , which is an appropriate metric for quantifying energy-efficiencyin high-end server processors [2]. The design space is built upby varying the ROB size from 8 to 256 entries; the LSQ is var-ied from 4 to 256 entries (with the additional constraint that theLSQ is never larger than the ROB); and the processor width (de-code, dispatch, issue and commit) is varied from 2 to 8 wide. Notethat all of these statistical simulations are run from a single sta-tistical profile using 1M-instruction synthetic traces; this is morethan a factor 100X faster than the detailed simulation using 100Msimulation points in our experimental setup. The optimal designpoints identified through statistical simulation with enhanced mem-ory data flow modeling exactly matched the optimal design pointsidentified through detailed simulation for 20 out of the 26 bench-marks; for the other 6 benchmarks, the optimal design point iden-tified through statistical simulation was within 3% of the optimumidentified through detailed simulation. This is far more accuratethan what is obtained through statistical simulation without the en-hanced memory data flow modeling, i.e., prior work. For 11 bench-marks, prior work gets off the optimal design point and for four ofthese benchmarks, the deficiency is even fairly large: art (4.5%),mcf (5.3%), bzip2 (5.7%) and equake (14.6%). As such, we con-clude that statistical simulation enhanced with accurate data flowmodeling is highly accurate (and significantly more accurate thanprior work) in identifying a region of (near-)optimal design pointsin a large design space.

0

1

2

3

4

5

6

7

3 4 5 6 7 8 9 10 11

order k of SFG

sta

tisticalpro

file

dis

kspace

(MB

)

Figure 12: Average disk space requirements for storing the sta-tistical profiles in MB as a function of the order k of the SFG.

6.5 Storage requirementsFigure 12 shows the average size of the (compressed) statistical

profiles in MB as a function of the order k of the SFG; these areaverage numbers over all the benchmarks. For k = 8 and k = 10,the average statistical profile requires 4.5MB and 5.8MB of diskspace, respectively. We thus conclude that the storage requirementsare small.

7. CONCLUSIONDesigning a new microprocessor is extremely time-consuming

because of the large number of simulations that need to be run dur-ing design space exploration. On top of that, every single simula-tion run takes days or even weeks if complete benchmarks need tobe simulated. Statistical simulation is a fairly recently introducedapproach that could help to reduce the time-to-market of new mi-croprocessors. Statistical simulation is a very fast simulation tech-nique that only requires in the order of a million instructions perbenchmark to make an accurate performance estimate. As such,statistical simulation is a useful tool to cull a huge design space inlimited time; a small region of interest identified through statisticalsimulation can then be further analyzed through more detailed andslower simulation runs.

Previous work on statistical simulation however considered sim-ple memory data flow models. In this paper we proposed to moreaccurately model memory data flow and we show how to do that.We model delayed hits, load forwarding and cache miss correlation.Our experimental results using the SPEC CPU2000 integer bench-marks show that significant reductions in IPC prediction errors areobtained by more accurately modeling memory data flow charac-teristics. For our baseline configuration we reported a reduction inaverage IPC prediction error from 10.7% down to 2.3%. We alsoshowed that the variation in IPC prediction errors across differentmicroarchitectures is significantly smaller when memory data flowis modeled. In addition, performance trends are predicted more ac-curately which is extremely important for design space explorationpurposes.

AcknowledgementsLieven Eeckhout is supported by the Fund for Scientific Research—Flanders (Belgium) (FWO—Vlaanderen). This research is alsosupported by Ghent University, the Institute for the Promotion ofInnovation by Science and Technology in Flanders (IWT), the HiPEACNetwork of Excellence and the European SCALA project No. 27648.

8. REFERENCES[1] R. Bell, Jr. and L. K. John. Improved automatic testcase synthesis for

performance model validation. In ICS’05, pages 111–120, June2005.

[2] D. Brooks, et al. Power-aware microarchitecture: Design andmodeling challenges for next-generation microprocessors. IEEEMicro, 20(6):26–44, November/December 2000.

[3] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework forarchitectural-level power analysis and optimizations. In ISCA-27,pages 83–94, June 2000.

[4] R. Carl and J. E. Smith. Modeling superscalar processors viastatistical simulation. In Workshop on Performance Analysis and itsImpact on Design (PAID-98), held in conjunction with ISCA-25,June 1998.

[5] T. M. Conte, M. A. Hirsch, and K. N. Menezes. Reducing state lossfor effective trace sampling of superscalar processors. In ICCD-96,pages 468–477, Oct. 1996.

[6] L. Eeckhout, R. H. Bell Jr., B. Stougie, K. De Bosschere, and L. K.John. Control flow modeling in statistical simulation for accurateand efficient processor design studies. In ISCA-31, pages 350–361,June 2004.

[7] L. Eeckhout and K. De Bosschere. Hybrid analytical-statisticalmodeling for efficiently exploring architecture and workload designspaces. In PACT-2001, pages 25–34, Sept. 2001.

[8] L. Eeckhout, S. Nussbaum, J. E. Smith, and K. De Bosschere.Statistical simulation: Adding efficiency to the computer designer’stoolbox. IEEE Micro, 23(5):26–38, Sept/Oct 2003.

[9] S. Eyerman, L. Eeckhout, and K. De Bosschere. Efficient designspace exploration of high performance embedded out-of-orderprocessors. In DATE’06, pages 351–356, Mar. 2006.

[10] K. I. Farkas and N. P. Jouppi. Complexity/performance tradeoffswith non-blocking loads. In ISCA-21, pages 211–222, Apr. 1994.

[11] C. Hsieh and M. Pedram. Micro-processor power estimation usingprofile-driven program synthesis. IEEE TCAD, 17(11):1080–1089,Nov. 1998.

[12] V. S. Iyengar and L. H. Trevillyan. Evaluation and generation ofreduced traces for benchmarks. Technical Report RC 20610, IBMResearch Division, T. J. Watson Research Center, Oct. 1996.

[13] V. S. Iyengar, L. H. Trevillyan, and P. Bose. Representative traces forprocessor models with infinite cache. In HPCA-2, pages 62–73, Feb.1996.

[14] M. Johnson. Superscalar Microprocessor Design. Prentice Hall,1991.

[15] T. S. Karkhanis and J. E. Smith. A first-order superscalar processormodel. In ISCA-31, pages 338–349, June 2004.

[16] A. J. KleinOsowski and D. J. Lilja. MinneSPEC: A new SPECbenchmark workload for simulation-based computer architectureresearch. Computer Architecture Letters, 1(2):10–13, June 2002.

[17] D. Kroft. Lockup-free instruction fetch/prefetch cache organization.In ISCA-8, pages 81–87, May 1981.

[18] S. S. Mukherjee, S. V. Adve, T. Austin, J. Emer, and P. S.Magnusson. Performance simulation tools: Guest editors’introduction. IEEE Computer, 35(2):38–39, Feb. 2002.

[19] D. B. Noonburg and J. P. Shen. A framework for statistical modelingof superscalar processor performance. In HPCA-3, pages 298–309,Feb. 1997.

[20] S. Nussbaum and J. E. Smith. Modeling superscalar processors viastatistical simulation. In PACT-2001, pages 15–24, Sept. 2001.

[21] M. Oskin, F. T. Chong, and M. Farrens. HLS: Combining statisticaland symbolic simulation to guide microprocessor design. InISCA-27, pages 71–82, June 2000.

[22] E. Perelman, G. Hamerly, and B. Calder. Picking statistically validand early simulation points. In PACT-2003, pages 244–256, Sept.2003.

[23] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder.Automatically characterizing large scale program behavior. InASPLOS-X, pages 45–57, Oct. 2002.

[24] R. A. Sugumar and S. G. Abraham. Efficient simulation of cachesunder optimal replacement with applications to misscharacterization. In SIGMETRICS’93, pages 24–35, 1993.

[25] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe.SMARTS: Accelerating microarchitecture simulation via rigorousstatistical sampling. In ISCA-30, pages 84–95, June 2003.

Documents

Accurate Memory Data Flow Modeling in Statistical Simulationleeckhou/papers/ics06_statsim.pdf · Accurate Memory Data Flow Modeling in Statistical Simulation ... modeling of memory