8

Click here to load reader

[IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

  • Upload
    ajc

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

A Statistical Approach to Branch Modelingin Static Program Performance Prediction

Hasyim Gautama and Arjan J.C. van GemundFaculty of Information Technology and Systems

Delft University of TechnologyP.O. Box 5031, NL-2600 GA Delft, The Netherlands

fh.gautama, [email protected]

Abstract

Current static performance prediction methods havebeen less successful in statistically accounting for programworkload distribution due to input data set variability, ofwhich data-dependent branches are usually the most impor-tant contributors. While data-dependent basic block execu-tion time is often characterized in terms of, e.g., mean andvariance, branching conditions still are typically charac-terized by only one parameter, usually known as the truthprobability. In this paper we propose and evaluate threestatistical approaches to modeling branching behavior, tobe used within a compositional method to predict programexecution time distribution. The approaches are coinedthe Empirical, the Bernoulli, and the ARP (Alternating Re-newal Processes) approach. While the Empirical approachis based on measuring branching behavior in terms of thesurrounding loop construct, the other approaches aim at de-riving a statistical model of the branch itself, which enablesa higher level of compositionality. Our measurement re-sults, based on synthetic as well as on real programs, showthat the Empirical approach delivers the highest accuracy,whereas the alternative approaches trade accuracy for com-positionality. For Markovian branches, the compositionalapproaches deliver high prediction accuracy. In contrastto intuition and our synthetic experiments, in real programsthe two-parameter ARP approach does not always outper-form the one-parameter Bernoulli approach.

1. Introduction

A challenging problem in the performance analysis ofparallel and distributed systems is to predict the executiontime of data-dependent parallel programs. For analytic sim-plicity, task workloads are often assumed to be constant(deterministic), thus allowing a relatively simple, one-valueprediction. However, for highly data-dependent programs,

such as sorting programs and simulation programs, knowl-edge about the execution time distribution can be crucial,for instance in time-critical (real-time) applications whereonly some percentage of runs (or none) is allowed to ex-ceed some execution time threshold. Hence program perfor-mance prediction using stochastic parameters is more effec-tive and realistic than using deterministic parameters [16].

In [6, 7] we have shown that the execution time distribu-tion of task and data parallel compositions can be accuratelypredicted at very low cost in terms of the mean, variance andhigher statistical moments, provided that the execution timedistribution of the constituent tasks can be predicted. In thisresearch it has been also shown that parallel program per-formance is highly dependent on the distribution (e.g., vari-ance) of the task’s execution times. Consequently, a criticalsuccess factor in parallel program performance prediction isthe accuracy in sequential program distribution prediction.

Sequential program distribution is determined by basicblock execution time distribution, and stochastic controlflow, comprising data-dependent loop bounds and branches.Consider the following loop:

for (i = 1; i <= N; i++)if (C)

S;

that contains a branch and a statement (block). In gen-eral, the execution time distribution of the loop dependson the distribution of the loop bound N, the distribution ofthe branch condition C, and the distribution of the execu-tion time X of the statement S. When only mean values areconcerned, performance prediction is relatively simple. As-suming that the mean truth ratio of the branch condition Cis p, averaged over a representative set of input data vectors,the mean value of the program execution time T is predictedby

E[T ] = E[N ]pE[X ]: (1)

Computing the variance of T , let alone the distribution of

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 2: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

T , however, is an extremely complex problem.There are a number of probabilistic approaches to pro-

gram execution time analysis, the specific trade-off betweenanalysis cost and prediction accuracy depending on howworkload distributions are represented. For instance, letthe conditional composition of C and S in the above ex-ample have execution time distribution Y , and let N bedeterministic (i.e., data-independent). Then the exact so-lution for T would still require an N -fold convolution ofcumulative density function of Y [8]. While the solutionis exact, the complexity involved with program-wide con-volution of anything but the most simplest of distributionsprohibits practical application.

Recently, a statistical approach to static program per-formance prediction has been presented where distribu-tions of variables such as T , N , and X are representedin terms of the first four statistical moments, for exampleX = (E[X ];V ar[X ]; Skw[X ];Kur[X ]) which denote themean, variance, skewness and kurtosis of X , respectively.The method has been successfully applied to the analysis ofsequential composition [5], parallel composition [6, 7]. Thefirst four moments of the execution time T of an arbitrarycomposition of sequential and/or parallel loops is analyti-cally expressed at O(1) solution complexity in terms of thefirst four moments of the basic block execution times andthe loop bounds.

While sequential and parallel composition is relativelystraightforward, conditional composition poses additionalproblems. Although the moments calculus underlying con-ditional composition is well-understood [5], statisticallycharacterizing branching behavior, however, turns out to bequite far from trivial. This is also illustrated by the aboveexample, where the branch has been modeled in terms ofa derived parameter p, in contrast to parameters such as Nand X which can be directly measured. Yet, in our com-positional approach we aim at statistically modeling the be-havior of C, rather than resorting to modeling, e.g., the en-tire sequential composite, in which case we would have tosacrifice analytical information on the effect of N and X .

In this paper we address the following question: Howcan branches be statistically modeled such that the modelparameters can be accurately determined by measurement,and subsequently be successfully used within our compo-sitional approach to execution time distribution prediction?Reflecting the large impact of data-dependent branching onprogram execution time distribution, as well as the difficultyof integrating branching into a statistical framework, wepropose and evaluate three different approaches to statisti-cally modeling data-dependent branching behavior. The ap-proaches are coined the Empirical approach, the Bernoulliapproach, and the ARP (Alternating Renewal Processes)approach, respectively. The Empirical approach is inspiredby taking into account that p actually has a distribution

of its own when measured across the spectrum of inputdata sets (i.e., a branch is characterized by a stochastic pa-rameter P , rather than the deterministic parameter p). Inthe Bernoulli approach a branch is simply modeled by aBernoulli model with parameter p. In the ARP approach,a branch is modeled in terms of Alternating Renewal Pro-cesses, which extends the Bernoulli approach by modelinga branch in terms of two parameters, rather than one. To thebest of our knowledge, such an in-depth study on statisticalbranch modeling in static performance prediction has notbeen conducted before.

The remainder of the paper is organized as follows. InSection 2, we review current approaches towards modelingbranching behavior. In Sections 3, we derive the executiontime of conditional composition in terms of the first fourmoments. In Sections 4, 5, and 6 we describe the Empiricalapproach, Bernoulli approach, and ARP approach, respec-tively, followed by a discussion in Section 7. In Section 8,we test our method using synthetic branch distributions aswell as distributions measured from real applications. Fi-nally, in Section 9 we draw some conclusions.

2. Branch Modeling

Reflecting the large impact of branches on program be-havior, there has been quite a lot of work on branch behav-ioral modeling, although from an entirely different perspec-tive. Branch modeling is applied for various reasons. Indynamic approaches branching behavior is modeled in or-der to predict the branch outcomes such that performanceloss due to, e.g., pipeline stalls, can be minimized. This ap-proach is often referred to as branch prediction. The branchoutcome can be predicted at run-time based on the execu-tion history during previous n outcomes [12, 17] or stati-cally at compile-time based on either branch profile infor-mation [20] or heuristics [3, 4] (e.g., prediction is based onits location). Although the dynamic approach is differentfrom ours, the behavioral branches models used, are a valu-able source of modeling information (see Section 8.2).

In static approaches, such as in our statistical perfor-mance prediction approach, branching behavior is predictedat compile-time with the intention to predict program exe-cution time rather than branch outcomes at run time. Typ-ically, branches are only modeled in terms of their meantruth probability. Approaches to this effect includes thework of Adve and Vernon [2], Lester [10], Van Gemund [9],Sarkar [15], and Wagner, Maveric, Graham and Harri-son [19]. In these approaches, branch probabilities are ef-fectively assumed to have zero variance, while, in fact, thevariance across the spectrum of data sets can be consid-erable. In the approach by Trivedi [18] variance is takeninto account by modeling branch probabilities in terms of aBernoulli distribution. In our work we essentially general-

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 3: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

ize over the above static approaches.

3. Methodology

As the branches we consider are typically executed in aloop (nest), we consider the following conditional compo-sition sequence:

for (i = 1; i <= N; i++)if (C)

S;

as the basic kernel for which we derive the execution timedistribution, denoted T , in terms of the first four moments.Let the truth probability of branch condition C be modeledby random variable P while the workload of the loop boundN and the statement S are represented by random variablesN and X , respectively. Also let the conditional composi-tion of C and S have execution time distribution Y . In [5] itis proven that the execution time of the above loop is givenby (sequential composition)

E[T r] = E

264 dr

dtr

0@ rXj=0

tjE[Y j ]

j!

1AN�������t=0

375 ; (2)

where (conditional composition)

E[Y r] = E

264 dr

dtr

0@ rXj=0

tjE[Xj ]

j!

1AP�������t=0

375 : (3)

Note that E[Xr] denotes the rth raw moment of randomvariable X from which the derivation first four momentsis straightforward. Since we focus on the branch, for sim-plicity we let the statement S take unit execution time, i.e.,X = 1. From Eq. (2) the first four moments of T are givenby [5]

E[T ] = E[N ]E[P ]; (4a)

V ar[T ] = E[N ]V ar[P ] + E[P ]2V ar[N ]; (4b)

Skw[T ] = (E[N ]Skw[P ]Std[P ]3 + E[P ]3Skw[N ]Std[N ]3

+ 3E[P ]V ar[P ]V ar[N ])=Std[T ]3; (4c)

Kur[T ] = (E[N ]V ar[P ]2(Kur[P ]� 3)

+ E[P ]4Kur[N ]V ar[N ]2

+ 6E[P ]2V ar[P ](Skw[N ]Std[N ]3+E[N ]V ar[N ])

+ 4E[P ]V ar[N ]Skw[P ]Std[P ]3

+ 3V ar[P ]2(E[N ]2 + V ar[N ]))=Std[T ]4: (4d)

Thus, given the first four moments of N and P , we canevaluate the execution time distribution of above loop using

Eq. (4). However, as mentioned in Section 1, obtaining themoments of P is far from trivial. In the next three sectionswe present different approaches to modeling branching interms of P .

4. Empirical Approach

In this section we present an empirical approach to mod-eling branching which is inspired by how branches are mea-sured in practice. Given input data sets we can study branchpatterns using the above basic loop kernel in Section 3. Forinstance, consider the branch outcome patterns in Table 1expressed in terms of true (t) and false (f). For simplic-ity the table shows two streams (S) with the same length,i.e., N = 30, although typically N is stochastic (data de-pendent). In the table the first stream shows much moreswitching between t and f than the second stream, althoughT is not much different.

S Branch outcome patterns N T

1 ttfftffttfffttttfttfffttttfftf 30 162 tttttttfffffffftttttttffffffff 30 14

Table 1. Branch outcome patterns in terms oftrue and false.

To obtain the branch probability we use a random vari-able Pm as proposed in [11] defined by the ratio betweenthe loop execution time T and the loop bound N accordingto

Pm =T

N: (5)

Similar to [11], for simplicity we will consider Eq. (5) fordeterministic (i.e., data-independent) loop bounds N sincederiving the moments of Pm for stochastic N is mathemat-ically very complex [13]. The mean and variance of Pm aregiven by

E[Pm] = E[T=N ] = E[T ]=E[N ] (6a)

V ar[Pm] = E[(T=N)2]� E[T=N ]2 = V ar[T ]=E[N ]2 (6b)

Similarly we can also show that Skw[Pm] = Skw[T ] andKur[Pm] = Kur[T ].

The above results express the moments of Pm in termsof the measured moments of N and T . In this empirical ap-proach, in turn, we can, of course, simply reproduce the mo-ments of T from the moments ofN andPm. However, sincePm is determined in terms of a loop composition, ratherthan in terms of an isolated branch, Pm is not directly ap-plicable to our compositional method in Eq. (3). Yet, Pm

can serve as the basis for our compositional method. Asan estimation for P , we introduce a truth probability P e,the ’e’ denoting our empirical approach, by relating Eq. (6)

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 4: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

with Eq. (4) for V ar[N ] = 0. Then the moments of Pe aredefined according to

E[Pe] = E[T ]=E[N ] (7a)

V ar[Pe] = V ar[T ]=E[N ] (7b)

Skw[Pe] = Skw[T ](E[N ])1=2 (7c)

Kur[Pe] = (Kur[T ]� 3)E[N ] + 3 (7d)

Eq. (7) allows us to determine Pe in terms of N and T .Then by substituting Pe into Eq. (4), we obtain the first fourmoments of T , denoted Te.

Clearly, the Empirical approach delivers high accuracy,provided N is deterministic. However, although the mo-ments of Pe have been derived empirically, the branchingbehavior itself has not been modeled. Consequently, wehave no insight on the effect of N and X , which is theaim of our compositional method (once the isolated branchwould be modeled, we could still vary N and X .). Hencein the next sections we study the Bernoulli and ARP ap-proaches. The Empirical approach, however, serves as areference model to which the alternative approaches will becompared.

5 Bernoulli Approach

An intuitive approach to statistically modeling branchesis to assume P to be a Bernoulli trial with a determinis-tic parameter p that equals the average value of the truthfrequency ratio as profiled across an input data set trainingcorpus. The Bernoulli random variable Pb is defined by

Pb =

�1; if C = true;0; if C = false:

(8)

In other words, random variable Pb maps true and falsebranch outcomes to ones and zeros, respectively. For in-stance, consider Table 2 that shows the branch outcome pat-terns, corresponding to those in Table 1, expressed in termsof ones and zeros. Table 2 also shows that p = 0:5 as deter-mined from the entire measurement.

S Branch outcome patterns p

1 1100100110001111011000111100100.5

2 111111100000000111111100000000

Table 2. Branch outcome patterns in terms ofone and zero.

For a Bernoulli variable with parameter p, the rth rawmoment of Pb is given by

E[P rb ] = p: (9)

From Eq. (9) the first four moments are given by

E[Pb] = p (10a)

V ar[Pb] = p(1� p) (10b)

Skw[Pb] = p(1� p)(1� 2p)=Std[Pb]3 (10c)

Kur[Pb] = p(1� p)(1� 3p+ 3p2)=Std[Pb]4: (10d)

Eq. (10) shows that the moments of Pb can be evaluatedfrom the single value p. Taking Pb as estimator for P inEq. (4) we obtain the first four moments of T , denoted T b.

By definition, a Bernoulli model is only appropriatefor branches that do not depend on previous outcomes,while many practical branches are far from memoryless.For instance, consider a cyclic branch which generates1010 : : :1010 across the training data sets, i.e., p = 0:5.While obviously V ar[T ] = 0 for N constant, in contrast,Eqs. (10b) and (4b) predict V ar[Tb] = 0:25E[N ] (linear inE[N ]). Consequently, the Bernoulli model is often inca-pable to provide a statistical explanation for program ex-ecution time variance (and higher moments) as caused bymany branches in practice. Rather than using a single valuep, in the next section we present an alternative branchingmodel based on two parameters.

6 ARP Approach

The application of Alternating Renewal Process (ARP)theory to performance prediction of a conditional compo-sition sequence is based on the fact that branching processbehave as ARP. That is, the branch outcomes can be consid-ered as consecutive trues (up time U ) or falses (down timeD) from which the total execution time distribution T canbe derived. For instance, consider Table 3 that shows thebranch outcome patterns, corresponding to those in Table 1,expressed in terms of U and D. Each cluster of true andfalse outcomes are labeled according to the cluster lengthand accounted for in terms of the sample of U and D, re-spectively, each of which comprises nine samples as shownin the table. From the samples we can determine the meanand variance ofU andD, from which the branch probabilitycan be evaluated as described in what follows.

S Branch outcome patterns E V ar

1U ! 2 1 2 4 2 4 1

[U ]=3.3 [U ]=4.9D ! 2 2 3 1 3 2 1

2U ! 7 7

[D]=3.3 [D]=6.7D ! 8 8

Table 3. Branch outcome patterns in terms ofU and D.

To obtain the branch probability P we use the estima-tor Pa based on ARP. Since in ARP it is customary to ad-dress the analysis in terms of total down time TD rather than

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 5: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

up time TU , we will in the following derive the moment ofQa = 1 � Pa. Note that TU is equal to the execution timedistribution T .

Following [11] a random variable is introduced which isdefined as the conditional probability of TD(t)=t, t denotingthe time, given that TD(t) > 0 according to

Qa(t; x) = P

�TD(t)

t� x j TD(t) > 0

�: (11)

Note that t is comparable with N in the conditional com-position sequence. In the sequel of this section we simplydenote Qa rather than Qa(t; x). The reason for represent-ing Qa using Eq. (11) is twofold. First, Eq. (11) is a validdistribution to model branching since Eq. (11) lies in the in-terval [0; 1]. Second, Eq. (11) represents known propertiesof TD(t) from which the exact mean and variance of TD(t)can be easily derived [11]. In [11] it is shown that the den-sity of Qa can be approximated using the beta distributionaccording to

fQa(x) �

�(a+ b)

�(a)�(b)xa�1(1� x)b�1 (12)

for 0 < x < 1; a > 0; b > 0, where

a =E[D]3(E[U ]�V ar[U ])+E[U ]2E[D](E[D]�V ar[D])

(E[D] + E[U ])(E[D]2V ar[U ] + E[U ]2V ar[D])

and b =E[U ]

E[D]a: (13)

Note that the beta distribution exhibits a great diversity ofshapes in a finite interval so that the approximation doesnot limit our analysis since Qa is by definition finite in theinterval [0; 1]. The rth raw moment of Qa is given by

E[Qra] =

�(a+ b)�(a+ r)

�(a)�(a+ b+ r)(14)

From Eqs. (13) and (14) the mean and variance of Q a canbe expressed in terms of input moments D andU as follows

E[Qa] =E[D]

E[D] + E[U ](15a)

V ar[Qa] =E[D]2V ar[U ] + E[U ]2V ar[D]

(E[D] + E[U ])3(15b)

Extending this result derived in [11] we derive the skewnessand kurtosis of Qa according to

Skw[Qa] =�2(E[D]�E[U ])(E[D]+E[U ])2Std[Qa]

g1(D;U)(15c)

Kur[Qa] = 3(E[D] + E[U ])2[E[D]2E[U ]2

+ (2(E[D]� E[U ])2 + E[D]E[U ])

� V ar[Qa](E[D] + E[U ])2]

� (g1(D;U)g2(D;U))�1 (15d)

where gn(D;U) = E[D]2(E[U ]+nV ar[U ])+E[U ]2(E[D]+nV ar[D]). Note that the random variables P and Q are cor-related linearly, i.e., P = 1�Q. Consequently,

E[P ] = 1� E[Q]; V ar[P ] = V ar[Q];

Skw[P ] = �Skw[Q] and Kur[P ] = Kur[Q]: (16)

Eqs. (15) and (16) allow us to express the first four momentsof Pa in terms of the input moments E[U ];V ar[U ];E[D],and V ar[D]. Then by substituting Pa into Eq. (4) we obtainthe first four moments of T , denoted Ta.

Since the ARP approach takes the clustering effect inbranch outcome patterns into account, the approach ap-plies to semi-Markovian branches [14]. Compared to theBernoulli, the semi-Markovian is a much wider distributionsince the up and down time (U and D) may take any validdistribution. If U and D are geometrically distributed, thenthe semi-Markovian becomes the Bernoulli.

7 Discussion

Since the Bernoulli and ARP approaches use a Bernoullirandom variable and a beta distribution, as branchingmodel, a correct application of both approaches is limitedto independent branching trials and semi Markovian work-loads [14], respectively. In turn, however, in contrast to theEmpirical approach by modeling the isolated branches, thetwo approaches preserve the compositionality of our mo-ment method. Hence, N may be stochastic for both ap-proaches. In Tables 1, 2 and 3 we have shown how P e,Pa and Pa are evaluated from the branch outcome patternsby taking samples. In the measurement the sample valuesmay be sensitive to any permutation between true and falseoutcomes. For instance, permutations within a loop (in onedata set), and between data sets are not allowed in the ARPand the Empirical approaches, respectively. In contrast, inthe Bernoulli approach the measurement of p is insensitiveto any permutation.

8 Experimental Results

8.1 Synthetic Workloads

In this section we evaluate the three approaches by pre-dicting the execution time distribution of the synthetic loopkernel mentioned in Section 3 for different standard distri-butions for P . To exclude other potential sources of error,we choose N deterministic while the statement S takes unitexecution time, i.e., X = 1. Since the mean of T for allapproaches is exact, we only consider the higher momentsof T . In this paper we exclusively focus on the executiontime variance, next to the mean being the most important

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 6: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

moment considered in the prediction of parallel composi-tion [1]. The variance error is defined as follows

" =j V ar[Tm]� V ar[Tp] j

V ar[Tm](17)

where Tm and Tp are values obtained from measurementand prediction, respectively. In our experiments we studythe variance accuracy of Eq. (4b) when using the three ap-proaches for obtaining P , i.e., Pe (the Empirical approach,Eq. (7)), Pb (the Bernoulli approach, Eq. (10)), and Pa (theARP approach, Eq. (15)).

In our first experiment we choose a genuine Bernoullibranch, i.e., P = Pb, to test whether all three approachesyield the same results. It is obvious that Pe = Pb while Pais obtained as follows. According to Bernoulli behavior, letD andU be geometrically distributed with parameter p suchthat E[U ] = 1=(1�p) andV ar[U ] = p=(1�p)2 and E[D] =1=p and V ar[D] = (1 � p)=p2. From Eq. (16) it directlyfollows that the moments of Pa in Eq. (15) conform to thatof Pb in Eq. (10). Thus P = Pe = Pb = Pa. Althoughthe use of the beta distribution makes the ARP approachessentially an approximation method, the moments exactlyagree with that of the Bernoulli model.

In the second experiment, we apply the approaches todeterministic branches, i.e., some predetermined number oftrue and false evaluations (e.g., cyclic branches). Clearly,for deterministic branches the variance of execution time isequal to zero (V ar[Tm] = 0). To generate the branches letD andU be deterministic according to E[D] = d;V ar[D] =0 and E[U ] = u;V ar[U ] = 0. For Pe in the Empiricalapproach, Eq. (7) yields

E[Pe] =u

d+ u; V ar[Pe] = 0;

Skw[Pe] = 0 and Kur[Pe] = 3 (18)

since in Eq. (6) Pm = u=d is constant. The same, correct,moment values are obtained for the ARP approach fromEq. (15). By substituting Pe and Pa into Eq. (4) we ob-tain V ar[Te] = V ar[Ta] = 0. For the Bernoulli approach itfollows

p =u

d+ u(19)

The prediction based on the Bernoulli estimator Pb re-sults in a dramatic maximum variance error of " ! 1(V ar[Tm] = 0 in Eq. (17)) for d = u, i.e., p = 1=2, andyields a minimum error of "! 0 as p! 0 or p! 1.

As a final experiment, we let D and U take discrete uni-form distributions with sample space [aD; bD] and [aU ; bU ],respectively, where 1 � aX � bX . The mean and vari-ance of D and U are given by E[X ] = (aX + bX)=2 andV ar[X ] = (bX � aX)

2=12.We choose aD = aU = 1 while we vary bD and

bU . To obtain Tm we simulate the branching process for

N = 50; 000 using 6; 000 data sets. For the Empirical ap-proach Te is equal to Tm (" = 0) since N is constant. Forthe Bernoulli approach we obtain p = (1+bU)(2+bD+bU ).Subsequently applying p in Eqs. (10) and (4), the varianceerror of the Bernoulli approach for various values of bD andbU is shown in Figure 1. The figure shows that " is ex-tremely large for small bD and bU , while slowly decreasingto zero for increasing bD and bU . For bD = bU = 1 it holds" ! 1 since P is deterministic as discussed previously.Clearly, for these not memoryless branches the Bernoullimodel is useless.

0

200

400

600

800

1000

1200

1400

2 3 4 5 6 7 8 9

Db = 3

Db = 5

bU

Db = 1

Db = 2

Db = 4

ε[%

]

Figure 1. " [%] for the Bernoulli method.

In contrast to the Bernoulli approach the ARP approachhas negligible error as shown by Figure 2. For all valuesof bD and bU the error is consistently below the 1% range,which is within the noise margin. Note that the ARP ap-proach is clearly applicable to all workloads.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3 4 5 6 7 8 9

bU

Db = 5Db = 3

Db = 2

Db = 4

Db = 1

ε[%

]

Figure 2. " [%] for the ARP method.

We summarize our findings as follows. All approachesare correct for branches with the memoryless Bernoulliworkload. When the workload is no longer memoryless,the Empirical and ARP approaches can be applied. The re-sults show that both approaches consistently perform betteror equal compared to the Bernoulli approach, effectively re-ducing the variance error by more than an order of magni-tude.

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 7: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

8.2 Markovian Workloads

In this section we evaluate the three approaches onbranching streams generated from state diagrams adoptedfrom [21] as shown in Figure 3, where the branching behav-ior is not memoryless. Originally these state diagrams wereMoore finite-state machines (FSM) used to predict dynamicbranch behavior. The current state determines whether thenext branch is predicted to be taken and the subsequentstate transition depends on whether the branch was actuallytaken. Designed to achieve high prediction performance,the FSM’s represent a model of the assumed branching be-havior, which make them useful as a basis for synthesiz-ing branches. In fact, these models have been successfullyproven to resemble real program branches [12, 17, 20, 21].To enable the use of the state diagrams in our experiment wemodify the FSM’s such that streams of ones and zeros cannow be generated, according to the following Markov inter-pretation shown by Figure 3. The output stream is deter-

1

0

0.1

0.9

0.1

0.9

(a)

1

1 1

0

0.9

0.9

0.50.5

0.1

0.1

0.50.5

(b)

1 1

0 0

0.9

0.9

0.50.5

0.1

0.1

0.5

0.5

(c)

1 1

0 0

0.9

0.9

0.1

0.1

0.5

0.5

0.5 0.5

(d)

1 1

0 0

0.9

0.9

0.5

0.1

0.1

0.5

0.5

0.5

(e)

Figure 3. Markov chains of branch trace gen-erator.

mined by the state, while the transition between two statesis determined by taking a Bernoulli sample with parametershown by the value of the corresponding arc. Experimentshave indeed shown that when supplied with a branch streamgenerated by our generators, the original branch predictorsachieve hit rates ranging from 90% to 96%, similar to theones reported in [21]. This proves the validity of our in-terpretation. (An identical branch behavior could also havebeen achieved by a different interpretation where the truthprobability is associated with the states rather than the arcs.)Note that the arc values are chosen such that all nodes willbe visited with probability greater than 0.

As to be expected, the variance error of the Empiricalapproach is exactly equal to zero while that of the other twoapproaches is shown in Figure 4. Subsequently applyingEqs. (10b) and (4b) the variance error is approximately be-tween 90% and 100%, showing that the Bernoulli method isindeed inadequate to reflect Markovian branches. In sharpcontrast to the Bernoulli approach, the ARP approach yields

exact results using Eqs. (15b) and (4b). For small N the er-ror is due to inaccuracies in measuring U and D. Again,the ARP approach is much better compared to the Bernoulliapproach, and has the same accuracy as the Empirical ap-proach in capturing the branch behavior.

0

10

20

30

40

50

60

70

80

90

100

0 200 400 600 800 1000 1200 1400 1600

+x+x

N

Figure 3aFigure 3bFigure 3cFigure 3dFigure 3e

ARP

Bernoulli

ε[%

]

Figure 4. " [%] for the Markovian workloads.

8.3 Empirical Workloads

In this section we illustrate to what extent the branchingmodels apply in practice. We have measured and predictedT for codes containing stochastic branches for 6; 000 in-put data sets, using a counter-based measurement technique.Our experiments include simple kernels, namely VectorScaling, Straight Selection Sort (SSS) and Cache Simulator,as well as applications found in practice, namely GaussianElimination, Single Source Shortest Path (SSSP), Quicksort(iterative fashion), and Compress from SPECint95. Otherbenchmarks in SPECint95 are not considered in our exper-iments due to the absence of stochastic branches and/or aninsufficient number of input data sets. In all benchmarkswe only consider the data-dependent branches. In particu-lar, for Compress we only consider the branches with highinvocation frequency.

We summarize our results in Table 4, including the errorpercentages for Te; Tb and Ta. The highest measured vari-ance V ar[Tm] is 2:0 108 in Compress b2. Significant errorsare produced by SSSP b2, Quicksort b2 and b6, and Com-press b4, which are branches that execute a break state-ment, in which case the effective loop frequency is highlystochastic as well as fully correlated with P . Clearly, forthis branch category our conditional composition modelscannot be applied. Other high correlated branches produc-ing large error, without executing break, are Cache Sim-ulator (CS), SSSP b1 and b3, and Compress b1 and b2. Forthis branch category the large error is due to correlation be-tween the loop and the branch. However, when the loopfrequency is constant, the Empirical approach can be ap-plied, such as SSS and Quicksort b1. From 19 branchesshown in Table 4, 8 branches can be categorized as Marko-

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)

Page 8: [IEEE Comput. Soc International Parallel and Distributed Processing Symposium (IPDPS 2003) - Nice, France (22-26 April 2003)] Proceedings International Parallel and Distributed Processing

Experiment" [%]

Te Tb Ta

VS 0.0 0.0 0.0SSS 0.0 76 32CS 27 27 77GE 0.56 32 0.56SSSP b1 14 79 6,500SSSP b2 1,300 860 96,000SSSP b3 14 79 6,500Quicksort b1 0.0 570 190Quicksort b2 680 920 940Quicksort b3 0.82 1.6 1.1Quicksort b4 0.0 0.0 0.0Quicksort b5 0.55 3.5 0.15Quicksort b6 85 960 450Quicksort b7 3.3 2.3 3.1Compress b1 38 47 47Compress b2 11 13 14Compress b3 0.15 0.69 0.70Compress b4 430 570 730Compress b5 0.0 71 7.8

Table 4. V ar[Tm] and " [%] for empirical work-loads.

vian branches, i.e., Vector Scaling (VS), Gaussian Elimi-nation (GE), Quicksort b3, b4, b5 and b7, and Compress b3and b5. From those branches 5 branches can be categorizedas Bernoulli branches, i.e., VS, Quicksort b3, b4 and b7, andCompress b3. Thus, our compositional approaches apply onalmost half of the branches considered.

9 Conclusion

Data-dependent branches are an important source of pro-gram execution time variability across the spectrum of pos-sible input data sets. In this paper we have evaluated theEmpirical, the Bernoulli, and the ARP approach to model-ing branching behavior, to be used within our compositionalmethod to predict program execution time distribution interms of statistical moments.

Our measurement results, based on synthetic as well ason real programs, show that the Empirical approach deliv-ers the highest accuracy, whereas the analytic approachestrade accuracy for compositionality. For the synthetic work-loads studied, the ARP approach is equal or better than theBernoulli approach, while for Markovian branches (includ-ing memoryless Bernoulli branches) the ARP approach de-livers excellent results. Also for real programs, 8 out of the19 stochastic branches studied exhibit Markovian behaviorwhich are amenable to the ARP approach. This inciden-tally implies that the applicability of Markov-based statis-tical branching models popular in branch prediction is lessthan expected.

References

[1] V. Adve and M. Vernon. The influence of random delays onparallel execution times. In SIGMETRICS ’93, pages 61–73,May 1993.

[2] V. Adve and M. Vernon. A Deterministic Model for ParallelProgram Performance Evaluation. Report TR98-333, RiceUniversity, Mar. 1998.

[3] T. Ball and J. Larus. Branch prediction for free. ACM SIG-PLAN Notices, 28(6):300–313, June 1993.

[4] B. Calder, D. Grunwald, D. Lindsay, J. Martin, M. Mozer,and B. Zorn. Corpus-based static branch prediction. In PLDI’95, pages 79–92, La Jolla, 18–21 June 1995.

[5] H. Gautama and A. v. Gemund. Static performance pre-diction of data-dependent programs. In WOSP ’00, pages216–226, Sept. 2000.

[6] H. Gautama and A. v. Gemund. Low-cost performance pre-diction of data-dependent data parallel programs. In MAS-COTS ’01, pages 173–182. Aug. 2001.

[7] H. Gautama and A. v. Gemund. Toward performance es-timation of data-dependent task parallel composition. InUKPEW ’02, pages 81–92, July 2002.

[8] E. Gelenbe, E. Montagne, R. Suros, and C. Woodside. Per-formance of block-structured parallel programs. In M. Cos-nard et al., editors, Parallel Algorithms and Architectures,pages 127–138. North-Holland, Amsterdam, 1986.

[9] A. v. Gemund. Symbolic performance performance mod-eling of parallel systems. IEEE Trans. PDS, Feb. 2003 (toappear).

[10] B. Lester. A system for the speedup of parallel programs. InICPP ’86, pages 145–152. IEEE, Aug. 1986.

[11] E. Muth. A method for predicting system downtime. IEEETrans. on Reliability, R-17(2):97–102, June 1968.

[12] S.-T. Pan, K. So, and J. Rahmeh. Improving the accuracy ofdynamic branch prediction using branch correlation. ACMSIGPLAN Notices, 27(9):76–84, Sept. 1992.

[13] T. Pham-Gia and N. Turkkan. System availability for agamma alternating renewal process. Naval Research Logis-tics, 46:822–844, 1999.

[14] S. Ross. Stochastic Processes. John Wiley & Sons, Inc.,New York, USA, second edition, 1996.

[15] V. Sarkar. Determining average program execution timesand their variance. In PLDI ’89, pages 298–312, 1989.

[16] J. Schopf and F. Berman. Performance prediction in pro-duction environments. In IPPS/SPDP ’98, pages 647–653,Mar. 30 – Apr. 3 1998.

[17] J. Smith. A study of branch prediction strategies. In ISCA’81, pages 135–148, May 1981.

[18] K. Trivedi. Probability and Statistics with Reliability, Queu-ing and Computer Science Applications. Prentice-Hall, NewJersey, 1982.

[19] T. Wagner, V. Maverick, S. Graham, and M. Harrison. Ac-curate static estimators for program optimization. SIGPLANNotices, 29(6):85–96, June 1994.

[20] Y. Wu and J. Larus. Static branch frequency and programprofile analysis. In MICRO-27, pages 1–11, 1994.

[21] T. Yeh and Y. Patt. Alternative implementations of two-leveladaptive branch prediction. In ISCA ’92, pages 124–134,1992.

0-7695-1926-1/03/$17.00 (C) 2003 IEEEProceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03)