25
] , COORDLNATED SC1ENCE LABORATORY College of En_ir, eering m !D AD-A222 808 ,| r | SINGLE-PASS I MEMORY SYSTEM ..l EVALUATION FOR MULTI PROGRAMMIN G -m WORKLOADS I Thomas M. Conte I Wen-reel W.__ _ DTIC I ILltELECTEiI_ g vD_ ;N UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN https://ntrs.nasa.gov/search.jsp?R=19900017242 2018-11-18T16:07:17+00:00Z

College of En ireering - NASA · College of En_ir,eering m!D AD-A222 808,| r | SINGLE-PASS I MEMORY SYSTEM..l EVALUATION FOR MULTI PROGRAMMIN G ... For example, a cache of dimep_ion

Embed Size (px)

Citation preview

]

, COORDLNATED SC1ENCE LABORATORYCollege of En_ir,eering

m

!D AD-A222 808

,|r

| SINGLE-PASSI MEMORY SYSTEM..l EVALUATION FOR

MULTI PROGRAMMIN G-m WORKLOADS

IThomas M. Conte

I Wen-reelW.__ _ DTICI ILltELECTEiI_

g vD_

;N UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

1990017242

https://ntrs.nasa.gov/search.jsp?R=19900017242 2018-11-18T16:07:17+00:00Z

_cumrv CLASSttCAI_ONOI_ T,_S oAGE- In I II _1 _ II I L i i-

I REPORTDOCUMENTATIONPAGE

i Unclasslf _-ed _None OF 'l

2a. SECURITYCLASSIF_ATK_NAUTHORITY 3. DISTRIBUTIONIAVAIUUIIUTY _EPORT

none Approved for public rele---ae;zb. OECU_S,FW.ATIO'NJOO_OIN¢ _04_O_LE distribution unlimited

I none ..........

4. PERFORMINGOI_auNIZATI_N REPORTNUMBER(S) S. _NtTOR|_ (_I_I_T_N RE_O_T NUM_ER(S_ I _,/

UILU-ENG-90-2214 CSG-122 none

i e,. N,_'_E i 6b OFFK:E 7_. _'U| OF MONnOm.._...OF SYMBOL

III

ORGANIZATION ORGANIZ.ATION--

Coordinated Science Lab 1 (NN_A_o_¢_} NSF, NCR, NASA, ONRUniversity of Illinois _ _

I -- " I

ADDRESS(Cs_, _m _ Z_#¢o_) 7b' ADDRESS(C_, $lael, _ ZIPC_)NSF:1800 G Street, Washington, DC 20552

Ii01 W. Springfield Avenue NCR:Personal Computer Div.-Clemson

Urbana, 1L 61801 1150 Anderson Dr., Liberty, SC 29657

I I • ii i i ........ i i_hl.NAME C._ FUNDING/SPONSORING lSb.OFFICESYMBOL

same as 7a. I N/A . NASA: NASA NAG 1-613 ONR:N00014-88-K-0656

I _ ADDRESS (O_, S t_, J_ Z_PC_ I_0. _OU_ OF FUNDING NUMBERS

same as 7b ELEM£NTNO. . CCESSIONNO.

I 11. TITLE _l_lude _I_ C,_rsification) , I

i Single-Pass Memory System Evaluation For Multiprogramming Workloadsii i i __ i III l i i

"z PERSONALAUT,OR(S)Conte, Thomas M. Hwu, Wen-mei W.

im n i i _____ r

I m'_.. TY,E OF REPORT 11 ,b° TIME COVERED ,I' 4" D']_ OF '_PORT1990 May _'_' _ _" I S" P_ COUNTTechnical | FROM I ' TO _ r _ i

1E.SUPPLEMENTARYNOTAT:ONnone

I- I ili Ill _ __ __11

17. CosAn CODES tB SUBJECTTERMS(Condne_ ,offn i_ejf _ #rid /denb_ _ _ e_ml_er)FIELD GI_OUP I SUB-GROUP ] memory system, cache per;o[-mance, staCK--Dased methoa

I / 1 I multipr°grammlng

I ;_ ABST_Cr(Co,,_ _ ,M_, _,_.,_;_ _,- _ _ ,_ _ . _ .' .Moderc memory systems are composes o. Levels o! cacne memorles, a vlrcUal memory sysEem, anu

I a backing store. Varying more than a few aeslgn parameters measuring performanceand the of

such systems has traditionally be constrained by the high cost of simulation. Models of cach_

performance recently introduced reduce the cost simulation but at the expense of accuracy of

I performance prediction. Stack--based methods predict performance accurately using one passover the trace for all cache sizes, but these techniques have been limited to fully-=

associative organizations. This paper presents a stack--based method of evaluating the per-

I formance of cache memories using a recurrence/conflict model for the miss ratio. Unlikeprevious work, the performance of realistic cache designs, such as dlrect-mapped caches, are

predicted by the method. The method also fnclades a new approach to the problem of the

effects of mult_programming. This new technique separates the characteristics of the Indl-

J vidual from that of the workload. The recurrence/conflict method is _hown to 5eprogram

practical, general, and powerful by comparing its performance to that of a Fopular traditlona

cache simulator. The authors expect that the availability of such a tool will have a large/

I I II I20. DISTRIBUTIONI AV/_LABIUTYOF ABSTRACT !21. ABSTRACTSECURII'YCLASSIFICATION

IXI UNCLASSIFIED/UNLIMITED [] SAMEASRPT. O o'r__E_SI Unclass:Lt :Led ,,,-_'

- i I22a. '/_ME OF R,SPONSIBLEINOIVIDUAI. llb. TELEPHONE(/ndud_ A_a Code) llc. OFFICESYMBOL

I .... i

IDO FORM 1473, M MAR ,i 8"3APR_l_on may Ixt useduntdexhaust_. SECURITYClASSiFICATIONOF THIS PAGEAll other editionsam obsolete.

i mlCLASSIFIED

1990017242-002

UNCLASSIFIEDM_'u_YY [email protected] O@?RI8 PAO|

i • II li I | I i • II

I7b. NASA Langley Research Center, Hampton, VA 23665

Office of Naval Research, 800 N. Quincy, Arlington, VA 22217 i@ I

ie

":-" | I I i ii_ i

_ UNCLASSIFIEDlll[_5111ilr_ ¢L,AM|P1CA11OII OF THIS ID&(iI[

I

1990017242-003

!!I

Single-Pass Memory System Evaluation For Multiprogramming Workloads

!Thomas M. Conte Wen-rneiW. Hwu

lI Center for Reliable and High-Performance Computing

i University of Illinoishwu@csg, uiuc. edu

!!

T

!!|

I Aoools'.o.to, _-_I DTIC TABUnanno_,lcedJust 1,*IcatIon

By....

/ % ...., D11st!Ibutlon/

"_G_ Availablllty Code°_lat [ Special

!

1990017242-004

!I Single-Pass Memory System Evaluation For Multiprogramming Workloads

!I Abstract

Modern memory systems are composed of levels of cache memories, a virtual mem-

ory system, and a backing store. Varying more than a few design parameters and

I measuring performance systems traditionally constrained by thethe of such has be

high cost of simulation. Models of cache performance recently introduced reduce thecost simulation but at the expense c" accuracy of performance prediction. Stack-based

I predict performance accurately using one pass over the trace for all cachemethods

sizes, but these techniques have been linfited to fully-associative organizations. This

paper presents a stack-based method of evaluating the performance of cache memories

t model for the miss ratio. Unlike theusing recurrence/conflicta previous work, perfor-mance of realistic cache designs, such as direct-mapped caches, are predicted by themethod. The method also includes a new approach to the problem of the effects of

I multiprogramming, new technique separates the individualThis the characteristics of

program from that of the workload. The recurrence/conflict method is shown to bepractical, general, and powerful by comparing its performance to that of a popular

t traditional cache simulator. The authors that the of such tool willexpect availability a

have a large impact on future architectural studies of memory systems.

!1 Introduction

!Because of the role they play in the design of cost-effective memory system_, cache memories

I have occupied a special place in research into computer architecture. In 1986, Smith compiled

I a bibliography of 380 papers on the topic covering fifteen years of research [1]. A majority of

these papers have focused on the performance evaluation for the design of c_.che memories.

!Some papers have evaluated cache performance as compared to other alternatives [2, .3]. In

I either case, cache design and evaluation is largely an empirical procedure. A benchmark

I set is selected and it is used to evaluate the cache performance for a design space. Theperformance evaluation technique of preference has been simulation. However, simulation is

I costly, limiting the .4esign space and the number of benchmarks the designer can consider.

I Two directions have been undertaken in the literature into alternate cache performance

1990017242-005

!evaluation methods. An analytical approach was introduced by Denning in [4]. A recent J

example of an analytical approach is presented by Agarwal, et. al. [5] and has been used for I

design space exploration by Przybylski et. al. [6].B

An alternative to an an_iy'Jcal model is a hybrid approach described collectively one- I

pass or stack algorithms. These were introduced by Mattson et. al. in [7]. They function I

by exploiting properties of stacking replacement policies to evaluate all fully-associative

!cache sizes in one pass over the trace. A recent example of the evolution of these ideas is

in Thompson and Smith [8], where one-pass algorithms are presented for fully-associative t

_ buffers for realistic policy decisions such as write back and sector mapping. Traiger and II

Slutz [9] present a method that addresses various levels of set associativities and blockam

sizes in one pass, but the amount of collected information required to reconstruct the cache I

performance is large. Due to this large storage requirement, their technique is impractical. I

Even when using the trusted simulation techniques for evaluation of cache memories, the

!issue of approximating operating system effects is troublesome. Multiprogramming has the

_m

effect of partially- or completely flushing a buffer at arbitrary instances during execution.

One approach to this problem used in [10] was to systematically flush the cache at fixed II

intervals. Using this technique, the designer can randomly insert context switches into a

simulation, but to get stable results requires increasing significantly the number of simu-

lations. Also, it has been shown that assuming fixed context switching intervals is overly

optimistic [11, 12]. Another approach is to estimate the cache performance using cold-start

miss ratios, but an assumption of no-saved context after a context switch has been shown to

not be true for large caches [12, 10]. Combining several reference streams into a stochastically

merged stream solves this problem, but at the cost introducing workload choice (e,g., sets

2

1990017242-006

!

I of benchmarks) into the evaluation problem [13]. Lastly, none of these approaches consider

I separately the voluntary context switching that occurs when a program makes a request of

the operating system.

!This paper presents the recurrence/conflict method of evaluating the performance for

I fully-associative, set-associative and direct mapped cache organizations for all cache and

I block sizes exactly using one pass over the trace. This method is based on the work ofMattson et. al., Traiger and Slutz, and Thompson and Smith, but alter3 the statistics col-

I lected to accommodate a new model for miss ratio calculation [7, 8, 9]. This model for the

I miss ratio reduces substantially the traditional storage requirements of the collected infor-

mation, making the method practical. A portion of the, information collected can be used to

!reconstruct multiprogramming effects due to both voluntary and involuntary (preemptive)

I context switching. Preemption frequency and partial flushes of the buffer are parameter-

i ized to separate workload considerations from bencbmark considerations. To evaluate themethod's practicality, the run time of its implementation is compared against a popular

I traditional cache simulator. Results from a set of benchmarks are presented to demonstrate

I the method's operation.

I 2 A Method for Memory System Evaluation

I A cache memory is a familiar concept. The dimension of a cache can be expressed as a three-

I tuple, (C, B, S), for a cache of size 2c byte_, with block size 2s blocks, and 2s blocks for eachassociativity set Note that C >_B + S. For example, a cache of dimep_ion (10, 6, 1) is a 1KB

I direct-mapped cache with a block size of 64 bytes. A cache of dimension (21, 10, tl) is of

I size 2MB with 1KB-length blocks and it is fully-associative (such a cache models a modern

1990017242-007

!virtual memory system). The notation (C, B, co) is an abbreviation for fully-associative I

caches (5' = C - B). I

Common metrics of cache performance are miss ratios and traffic ratios. One method ofm

calculating the miss ratio, p, is to count the number of instances that a miss occurred in N I

references. This number is the miss count, M, and the miss ratio is then, I

M

(1) I

The traffic ratio, a, a measure of the traffic on the memory bus generated by the cache, can I

be expressed as a = 2Bp. II

2.1 The recurrence/conflict model IIll

Because the traffic ratio is derived from the miss ratio and the block size, the miss ratio

suffices to characterize the performance of a cache memory. One method of calculating p is

to use Equation 1. Another method of calculation is based on the observation that all hits

occur due to recurring references. For example, consider the following string of references:

RefL ce number 1 2 3 4 5 6 7 8

Address 10G 101 102 103 102 191 104 101

References to addresses 100, 103, and 104 occur only once and result in a miss regardless

_., of the cache organization. References to 101 and 102 occur more than once and hence _

have potential for a hit, dependent upon the cache organization. Such references are called

recurring references, and there are three of them in the example (references 5, 6, and 8). The

references between recurring references are termed the intervening references. For example,

4

1990017242-008

!I references 3, 4 and 5 are the intervening references between the first recurring reference to

I address 101. There is a chance that for a given cache organization the intervening references

will remove the recurring reference from the cache, resulting in a miss instead of a hit. Such a

I situation is termed a conflict. If there are R instances of recurring references and K instances

I of conflicts, the miss ratio can be expressed as,

I p=l R-KY (2)

I This expression is termed the recurrence/conflict model for the miss ratio.

i The method of calculating the miss ratio for a large class of cache organizations accuratelyis based on the recurrence/conflict model. The method involves the calculation of two arrays,

I r[B] and a[C, B, S], using one pass over the reference string. An algorithm to perform these

I computations is presented in Figures 1 and 2. The procedure, recurrence_conflict (a), is

applied in-turn to each referenced address. The array stack[B] is an array of stacks managed

I bytheroutinespush(.),topofstack(.),depth(.),andrepush(.).The itemskepton astack

I areaddresses.Notethetwofunctions,zero_out.2sb(.)and count_trailing.zeros(.)The

functionzero_out_Isb(a,B)returnsa withB leastsignificantbitssetto zero(i.e.,the

Iblockaddressofaddressa).The functioncount_trailing.zeros(.)returnsthenumber of

I trailing zeros in the binary representation of a number. The algorithm explores a predefined

I maximum design space, delimited by the parameters C_, Bm,_, and ,$'m_- Also, a special

column is maintained in _:[C][B][ S']for conflicts in fully-associative caches, _[C, B, ¢o]. Note

I that any conflict that occurs in a cache of size (C, B, S), also occurs in a cache of size

I (C - I,B,S). The while loopofprocedureprocess_cycle(Figure2)implementsth,s

observation.The missratiofora cacheofdimension(c,b,s)canbecalculatedfromr[B]and

!5

!

1990017242-009

!recurrence_conflict(a, voluntary_cs): Ibegin

N_N + I ifor B_-0 to Bm_ do |begin

block_addr_zero_ovt..l sb(a, B) mifon_stack(stack[B],block_addr)then

d_depth(stack[B],block.addr)

process_cycle(B,d, block_addr) •repush(atack[B],block_addr)

else

push (stack[B],block_addr) Iend i

¢s_coung(block.addr)_--I

unmark_volunt ary_cs (block_ddr) Iif voluntary.cs then sark_voluntary.cs(block.addr)

end

end I

Figure 1: Driver routine for the recurrence/conflict algorithm.

[q[BIF1using Equation 3.

1 Cffi.ffi

p(c,b,s)-1- _ (r[b]-_ g[j,b,s]). (3) -j_arC

Sincetherecurrence/conflictalgorithmisbasedon theLRU stackalgorithmpresentedby

Mattson, et. al. in [7], it is of complexity O(N lg N) on average. The space storage required

for the resulting r[B] and sIC, B, S] is B,,,_ + (Cm_ + 1) x (Bm_ + 1) x (Sm_ + 2) + 1

words(typically,a C languagelong).Also,two arrays,MI[C][B][S]end Mv[C][B][S]are

calculatedtoaccountformultiprogramming(seeSection2.2below).Hence,fora design

spaceofsizeCm_ - 31 (2GB),Bm_ ---12(IKB),end Sin,- 3 (upto8 ways,end fully

associativity), the space storage requirements are approximately 6K words. This allows the

statistics to be readily stored as a disk file. Since the arrays tend to be sparse., the disk file

issmallerthanthisupperbound,inpractice.

6

1990017242-010

!process_cycle(B, d, block_addr):begin

| fiB] .-. fiB] + 1if top_of_stack(stack[B]) - block_addr then return

Let a E stack[B] and depth(stack[B],a) = d - 1

cs_count (a) *-- cs_count (a) +cs_cotmt (blockaddr)if marked_voluntary-cs (_lock_addr) the:, mark_voluntary_cs (a)

num unique _ 0

l _ 0cs points

voluntary cs _- false

for a E stack[B] and depth(stack[B],a)< d do

begindist,-cotmt_trailing.zeros([o - . .ockaddr[)

• p[d£st]_--p[dist] + I| max_dist *-- max(max_dis%dist)

hUm_unique _-- nua_unique + i

I if marked_voluntary_cs(block.addr) then voluntary_cs _- trueca points _- cs_points + cs count(a)end

: _ Ft_achesize_ Llgnum_uniqueJ1 if FA_cachesize > Cm_ then FA_cachesize ._- Cmax

_[FA_cachesize, B, c¢] _-- _[FA_cachesize, B, oo] + 1

1 Mt[Fa_achesize][E]{S]_--M,[ZA_achesize][L][S]+ cs.poiats1 dist+-m_x.dist

l sum+-Ofor S_0 to S._ do

i begin while dirt >_ 0 and sum < 2 do

begin

I sua.--s_+ p[dist]dist.--dist - 1

end

I Cmin, conf _-- 0if sua > i then

Cmin,conf_---dist + S + 1i

.1 + 1end

M,[Cmi.,con/l[O][Sl.-M,[Cmi.,co.ll[SllS ]+ c,,_,ints1 if voltmtaty e-s then Mvtc...,co..vltSltSl.-,,tc., .,co./ltBltS]+ I

i_-2 x i

I end end

I Figure 2: The process.cycle procedure to calculater[8] and _[C, 8, S_.

7

!

1990017242-011

!2.2 Multipr-)gramrning effects I

Estimating the effects of multiprogramming on cache performance is a well-known prob- 1

lem [10]. Several techniques have been employed to approximate these effects. The ,.old

Imiss ratio vs. warm miss ratio technique was examined by Easton in [12, 14]. Ex_.mples

of statistical approaches can be found in [11, 13]. This paper presents a method based on I

the recurrence/conflict model. Multiprograxnming is divided into two categories: voluntary Icontext switching and involuntary context switching. These categories _e explained below.

IVoluntary context switching

IA process performs a voluntary context switch when the continuation of its execution depends

on a system service which may take a long time to finish. The frequency and timing of a I

voluntarycontextswitchissolelyacharacteristicofthebenchmark.The numberofprocesses I

executedbeforea processretarnsfroma contextswitchis,however,a functionofthesystem

Jloadand theoperatingsystemschedulingpolicy.Forexample,theworkingsetofa process'

may have been purged from the cache before it re-enters the run state after a context switch. I

This results in a degraded cache performance as compared to an ideal execution of the same Ibenchmark without _,nv context switching.

There are two pieces of information that are associated with context switching. One is I

the number of potential victirr_, defined as the number of non-conflicting recurring references I

which may be converted from a hit to a miss. This iniormation is a function of the benchmark

and the cache dimension. The method presented in this paper provides this information I

exactly. However, the fraction of the potential victims which are actually convert _ to mist's I

is a function of the system's load and the operating system's scheduling policy. Hence, thi_8

1990017242-012

!

I fractionismodeledasa parameter,_.The designercanvarytheparametervaluebetween

I 0% and 100%toexaminethechangesindesigndecisionsbasedon informationcollectedin

on_yonepass.Thisfeaturedist:.ng_lishesthismethodfrommostofthepreviousoneswhere

Ivaryingthisparameterrequires_ re-simulation.

I The totalnumberofpotentia!_rdCtimsofallvoluntarycontextswitchesisme_.,uredusing

i the recurrence/conflict methc 1. E_.ch volcutary context switch point is marked in the trace,- and the potential victims of this coLtext switch point are identified as those non-conflicting

I recurring refm-enc_s which occur acr)ss the context switch poir_t. Th':s is implemented by

I marking references that occur !r:_mediately before voluntary context switch points.

The array, Mv[C][B][S], is used to record the number of potential vict'ms of voluntary

!context s_';tching. The method of updating My[-] is included in Figure 2. If Mv[c][b][8] is

I equal to n at the eng o_ the execution, it indicates that for all caches (d, b, s), c' >_c, n of

i all';hehitscanb_,potenti.dlyconvertedtomissesdue tovoluntarycontextswitches.Givena p_rcentageofpreservedcontextacrosscontextswitches,_,one canexpect_ofind_n of

theb.!tstobe convertedtomisses.The missratiofora cacheofdimension(c,b,s)inthe

I presence of voluntary context switching becomes,

I p(c,b s)= 1-"N _R[b]- J=¢_g[jj[b][S] -_ j_=oMv[j][b][s]) . (4)

I Involuntary context switch

I Involuntarycontextswitchingoccursduetoexternaleventssuchastimer-implementedpre-emptionand I/O deviceinterrupts.The frequencyand occurranceofinvoluntarycontext

I =switchingis a function of the system load and the operating system's scheduling policy, but

I nota characteristicoftheprogram.Therefore,itisassumedthatan involuntary_'ontext

1990017242-013

!switch has an equal probability of occurring after any reference. With this assumption, the I

recurrence/conflict method derives the average number of potential victims, VI, due to each Iminvoluntary context switch. A parameter, Q, is defined as the effective quantum (average

I

preemptioninterval).Hence,N/Q isthetotalnumberofinvoluntarycontextswitchese_- I

pectedfortheentirereferencestring.Therefore,thetotalnumberofhitsthatareconve-ted I

tomissesis(_NVI)/Q.Like_,onecanvaryQ overanarbitraryrangetoobservetheimpact

of involuntary context switching frequency on the design decisions.

To derive the average number of potential victims due to each involuntary context switch,

one can sum the number of potential victims for all possible switching points in the reference

string and divide this sum by the number of possible switch points (N). This is given in the

following formula,

Vl= _ _(allswitchingpoints)Number ofpotentialvictimsfora switchpoint . (5)

By exchangingtherolesofth_contextswitchesand thepotentialvictims,Equation5 can

berewritter_inthefollowingform,

i

(E(allpotentialvictims)Number ofswitchingpointsaffectingpotentialvictim).vt=

(6)

Equati¢,n6 fitsnaturallyintotherecurrence/conflictmethod.

Due tothelargenumber ofcontextswitchingpointsinvolved,a counter,ca_count(.),is

keptforeachelementon thestack.Eachtimea new stackelementiscreated,thiscounter

issetto I. When a recurringreferenceisprocessed,thecontextswitchingcountofits

stackelementisaccumulatedintothatoftheelementaboveitbeforeitispromotedtothe

topofthestack.Inthisway,allthereferencesoriginallybelowtheelementwillseethe

1O

1990017242-014

!

I same number of context switching points above them. The context switching count of the

I promoted element then is reset to one. (See Figures 1 and 2.)

The array, Mt[C][B][S], is used to record the total number of potential victims of all

!involuntary context switching points. If Ml[c] [b][s] is equal to n at the end of the execution,

I it indicates that for all caches (d, b, s), d >_c, n of all the kits will potentially be converted

I to misses due involuntary contex_ switching. The average number of potential victims perinvoluntary context switch for a cache of configuration (c, b, s) is,

I 1 c

M,[jl[b][s]. (7)V, = -_ .ffi!

The miss ratio for a cache of configuration (c, b, s) under multiprogramming is expressed in

I Equation 8.

p(c,b,s)---1- _ _,R[b]- _ g_][b][S]- _ Mv[i][b][s]- _ (8)jf¢ j=0

I 2.3 Trace collection

I A method of collecting the trace of a benchmark program is to annotate executable with

I special probe instructions. As these probes are executed, local-scope dynamic behavior is

recorded. Such a method is termed, keyhole ezperimentation, to emphasize that it is a dy-

!namic dual of retargetable _peephole" techniques used for local optimization [15]. Keyhole

I ezperimentation has been used to generate profiles of the programs' behavior, although the

I potential for more than just profile information gathering exists. The keyhole probes canbe placed by the compiler (e.g., GPROF [16]), the assembler (e.g., TRAPEDS [17]), or a

I separate object-code modifier (e.g., PIXIE [18]). Using keyhole experimentation at the com-

I pilerlevelisofgreatestusetoarchitects,sincethecompilerpossessesinformationaboutthe

| 11

1990017242-015

!program'sdataand instructionstructurebeforeoptimization.The AichitectsWorkbench I

(CARA), created by Flynn at Stanford [19], is one such tool. The System Parameter lnde- I

pendent Keyhole Experimenter (SPIKE) is a compiler-independent tool similar to CARA,

constructedbytheauthors.The currentversionofSPIKE hasbeenfittedintotheGNU CC

compiler [201,since GNU CC is capable of producing code for a variety of architectures. I

3 Experimental results I

The success of a cache performance evaluation method depends on its practicality. To

Table 1: The benchmark set.

Benchmark No. referencesDescription"' grep 4.1M The grepprogramfromUnix,used

fora searchthrough/usr/dict/worclstex 2.7M The TEX typesetter,usingthe'TripTeX'

diagnosticinput

yacc 722K The LALR(1) parser_generator from Unix,I ,_withthe grammar from make) as input

investigatethepracticalityoftherecurrence/conflictmethod,a setofbenchmarkprograms

was compiledfortheMC68020 and theirinstructionreferencebehaviorwas instrumented

usingSPIKE. The benchmarksaresummarizedinTable1.

The run timefortherecurrence/conflictmethod was comparedagainstDineroIll,a

reliablepublic-domaincachesimulatorconstructedby Mark Hill.Theseresultsarepresented

inTable2.The minutesof(user-mode)run timewerecollectedforeachbenchmarkusing

an unloadedSun 3/280.Note thatMthough texhad approximately1.2M lessreferences

thangrep,ittooklongertorun. Thisisdue tothenatureofstackalgorithms:theless

localitypresentina program,thelargertheaveragestackdepth.The worstslowdownwas

12

1990017242-016

!!

Table 2: Running time versus Dinero III.

I .... Time Average.....Recurrence/ Dinero III ratio

I Benchmark Conflict model (21,4,oo) (21,4,0j XCM/Dinerogrep h38 0:21 0:20 [ " 4.8tex 4:35 0:17 0:16 16

I yacc 0:53 0:03 0:03 18

I by a factor of 18. However, given tLat the design space explored contained approximately

I 31 x 10 x 5 = 1500 cache dimensions for each level in the memory system's hierarchy, therecurrence/confict model has a great advantage over conventional simulation.

I Since rex had the most interesting locality, results of the miss ratio for rex are presented

I in Figure 3. Set associativity is represented as a solid line for S = 0, a dotted line for S = 2,

and a dashed line for S = oo. (Because of its high performance, S = oo is only visible in the

I graph of B = 3.) After the execution of the recurrence/conflict method, the time required

I to generate the entire set of miss ratios for Figure 3 was under a second of user time. As

i many points as feasible were checked using Dinero, and all agreed with 100% accuracy.To see the effects of multiprogramming, tez was evaluated assuming B = 3, for _ =

I 100%, 90%, 80% and Q = 100,1000. The results are presented in Figure 4. Unfortunately,

I voluntary context switching information is not yet available in SPIKE at the time of this

writing. Therefore, only the results of involuntary context switching were evaluated. The

I results are presented as the difference between miss ratios of uniprogramming and of mul-

l tiprogramming (Ap). Note that the preemption interval dominated for Q = 1000, whereas

i the percentage of flushed context (_) had a large effect for Q = 100. This implies that, fortez, beyond a certain Q saved context has little bearing on instruction cache performance.

I

1990017242-017

i!!II

B=3 B=4 B=5 •

0.3 t 0.3 0.3 Imlb

0.2- 0.2 - 0.2 - D

P P P III

0.1- 0.i- O.l-

\ -0 l I 0 l I b_ l I _

0 i0 20 30 0 i0 20 30 0 10 20 30C C C -

B=6 B=7 B=8

0.3. 0.3 0.3 -

b

0.2- 0.2- 0.2-

P P P

0.1 0.1- 0.1-

0 i I 0 i I i 0 Li I0 10 20 30 0 10 20 30 0 10 20 30

C C CFigure3:Mirsratiosforfezforvario_tscachedimensions.

14

m

I

1990017242-018

IIII

= 100%,Q = 100 _ = 90%,Q = 100 _ = 80%,Q = 100

I

I Ap Ap Apo,o5- o.o5- 0,05-

I ' /I 0 'J I i 0 '/ i i 0 t i0 i0 20 30 0 I0 20 30 0 i0 20 30

C C C

I _ = 100%,Q = 1000 _ = 90%,Q = 1000 _ = 80%, ¢ = 1000

, tI 0.1 0. i - 0.1 -

i Ap Ap Ap0.05 - 0.05 - 0.05 -

I o _, , o-_--::- o _I I

I 0 I0 20 30 0 I0 20 30 0 I0 20 30C C C

I Figure 4: Multiprograrnming miss ratios for rex.

!!I 15

1990017242-019

4 Conclusions

This paper has presented a method to evaluate efficiently a very large design space for

cache memories. When used to evaluate cache hierarchies satisfying the inclusion property

(see [21]), an entire memory system can be evaluated in one pass. Although the algo-

rithm presented omitted the issues of write-back and sector-mapping for brevity, the stack

algorithm extensions of Thc:npson and Smith axe compatible with the recurrence/conflict

method [8]. Hence, the method is general. The method was shown to be efficient and hence

profitable to use.

The recurrence/conflict method is applicable to both design and architectural research.

Combined with design criteria such as described in [6], there is the potential of an automated

memory system design process. Since it evaluates a large memory system design space in

one pass, techniques for architectural studies into other interacting system tradeoffs can

be simplified and broadened in scope. Hence, there are a large number of future research

directions possible using the recurrence/conflict method.

The inclusion of context switching effects into the method is an advance of previous

work as it cleanly seperates the behavior of the benchmarks from the multiprogrammed

performance characteristics they exhibit. Previous approaches involved measuring snapshots

of actual multiprograrnming and using these traces for cache simulation. Such approaches

are restricted to phenomenological conclusions since the mix of executing processes and

the interprocess timings are not adjustable after measurement. This illustrates a powerful

feature of the recurrence/conflict model of the miss ratio: external effects that degrade cache

performance, such as context switching or coherence protocol invalidations, can be modeled

16

1990017242-020

!I as additional types of conficts, thereby isolating the performance of different design tradeoffs.

I [For interested parties, _ stable version of the tool written in portable C is freely available

from the authors.]

I:!!!I!I!I!!!IiI| 1T

1990017242-021

!Addendum I

This report was preseuted for review to the International Symposium on Computer Archi- I

tecture Program Committee in November of 1989. In December of 1989, Mark Hill and IAlan Smith published an article in IEEE Transactians on Computers, entitled, "Evaluating

associativity in CPU caches" (see [22]). Although Hill and Smith did not m_e '.h_ dis- I

tinction between recurrences and conflicts, the presented algorithm is similar to the RCM I

method. Since it is common in Science for two distinct research groups to discover an idea,

and common also for each group to have different insight, this report is being made available I

to present our insights into stack-based memory hierarchy analysis. The material in this I

report discussing evaluation of multiprogramming effects (context switching) is our own and Inot present in the Hill and Smith paper.

- T. M. Conte and W. W. Hwu, March, 1990 I

I!!!m

i8

1990017242-022

!

I Acknowledgements

I The authorswouldliketothankSadun Anik,DavidGriffithand allmembers oftheIM-

I PACT researchgroupfortheirsupport,commentsand suggestions.Thisresearchhasbeen

supportedbytheNationalScienceFoundation(NSF)underGrantMIP-8809478,a donation

I fromNCR, theNationalAeronauticsand SpaceAdministration(NASA) underContract

I NASA NAG 1-613incooperationwiththeIllinoisComputerlaboratoryforAerospaceSys-

I terns and Software (ICLASS), and the Office of Naval Research under Contract N00014-88-K-0656.

III!IIIIII!

1990017242-023

!References I

[1] A. J. Smith, "Bibliography of readings on CPU cache memories azad related topics," iComput. A_,'chitecturc News, vol. 14, pp. 22-42, Jan. 1986. |

[2] J. R. Goodman and W.-C Hsu, "On the use of registers vs. cache to minimize memory R

traffic," in Proc 13th Arms. Int'l Syrup. on Comput. Arch., pp. 375--383, Jan. 1986. i

[3] R. J. £ickenmeyer and J. H. Patel, "Performance evaluation of on-chip register andcache organizations," in Proc. 15th Annu. Int'l Syrup. on Comput. Arch., (Honolulu,Hawaii), pp. 64--72, May 1988.

ill

[4] P. J. Denning and S. C. Schwartz, "Properties of the working-set model," Communica- Itions ACM, vol. 15, pp. 191-198, Mar. 1972. i

[5] A. Agarwal, M. Horo',vitz, and J. Hennessy, "An analytical cache model," ACM Trans.Computcr Systems, vol. 7, pp. 184-215, May 1989.

[6] S. Przybylski. M. Homwitz, and J. Hennessy, "Characteristics of performance-optimalmulti-level cache hierarchies," in Proc. 16th Annu. Int'l Syrup. on Comput. Arch.,(Jerusalem, Israel), pp. 114-121, June 1989.

[7] R. L. Mattson, J. Gercsei, D. R. Slutz, and I. L. Traiger, "Evalutation techniques for Istorage hierarchies," IBM Systems J., vol. 9, no. 2, pp. 78--117, 1970.

[8] J. G. Thompson and A. J. Smith, "Efficient (stack) algorithms for analysis of write-back Iand sector memories," ACM Trans. Computer Systems, vol. 7, pp. 78--117, Feb. 1989.

[9] I. L. Traiger and D. R. Slutz, "One-pass techniques for the evaluation of memory hier- Iarchies," IBM Research Report RJ 892, IBM, San Jose, CA, July 1971.

[10] A. J. Smith, "Cache memories," ACM Computin9 Surveys, vol. 14, no. 3, PF- 473-530, I1982.

[II]I.J.Haikala,"Cachehitratioswithgeometrictaskswitchintervals,"inProc.llth IAnnu. lat'lSyrup.on Comput.Arch.,(Ann Arbor,MI),pp.364-371,June 1984.

J

[12] M. C. Easton, "Computation of cold.start miss ratios," IEEE Trans. Computers, vol. C- I27, pp. 404--408, May 1978. J

[13] G. S. Shedler and D. R. Slutz, "Derivation of miss ratic_ for merged access streams," •IBM J. Research and Development, vol. 20, pp. 505--517, Sept. 1976.

[14]M. C. Eastonand R. Fagin,_Cold-startvs.w_rm-startmissr_fios,"Communications -"ACM, vol. 21, pp, 866-872, Oct. 1978.

[15] J. A. Davidson and C. W. Fraser, "The design and application of a retargetable peepholeoptimizer," ACM Trans. Prog. Lang. and Systems, vol. 2, pp. 191-202, Apr. 1980. __

v

20

i

1990017242-024

!I [16] S. L Graham, P. B. Kessler, and M. K. McKusick, "gprof: A call graph execution

pr ."__.-r,"in Proe. 1982 SIGPLAN Syrup. on Compiler Con,traction, pp. 120--12e,,June

I 1982.[17] C. B. Stunkel and W. K. Fuchs, "TRAPEDS: producing traces for mu.deomputers via

I execution driven simulation," in Proc. ACM SIGMETRICS '89 and PERFORMANCE'89 Int 'l Conf. on Measurement and Modeling of Comput. Sits., (Berkeley, CA), pp. 70--78, May 1989.

I [18] MIPS Computer Systems, MIPS language programmer's guide, 1986.

I [19] C. L. Mitchell and M. J. F'ynn, "A workbench for computer architects," Design f_ Test,pp. 19-29, Feb. 88.

i [20] R. M. Stallmam Using and porting GNU CC. Free Software Foundation, Inc., 1989.[21] J.-L. Boer and W.-H. Wang, "Architectural choices for multi-level cache hierarchies,"

in Proc. 16th [nt'l Conf. on Parallel Processing, pp. 258-261, Aug. 1987.

I [22] M. D. Hill and A. J. Smith, "Evaluating associativity in CPI! caches," IEEE TransComputers, vol. C-38, pp. 1612-1630, Dec. 1989.

!I!!!!!

!I

1990017242-025