05-Evaluating Scalability Parameters

8/14/2019 05-Evaluating Scalability Parameters

1/25

5

Evaluating Scalability Parameters

With four parameters I can fit an elephant.

With five, I can make his trunk wiggle!

John von Neumann

5.1 Introduction

In this chapter we are going to take the theoretical discussion of Chap. 4 andshow how it can be applied to assessing the scalability of a particular hardwareplatform running a well-defined workload. Scalability, especially applicationscalability, is a perennial hot topic. See for example the following Web linksdiscussing the scalability of:

Linux: lse.sourceforge.net

P2P: www.darkridge.com/jpr5/doc/gnutella.htmlPHP: www.oreillynet.com/pub/wlg/5155Windows: www.techweb.com/wire/story/TWB19991015S0013

Yet few authors are able to quantify the concept of scalability. The two-parameter universal scalability model

C(p) =p

1 + (p 1) + p(p 1), (5.1)

derived in Chap. 4, predicts the relative capacity:

C(p) =X(p)

X(1), (5.2)

for any workload running on p physical processors.For the purposes of this chapter, the main points established in Chap. 4

can be summarized as follows. Scalability, as defined quantitatively by (5.1),is a concave function. The independent variable is the number of physicalprocessors p belonging to each hardware configuration. It is assumed that

there are N homogeneous processes executing per processor such that theratio N/p is held constant across all processor configurations. In other words,each additional processor is assumed to be capable of executing another Nprocesses. The concavity of C(p) is controlled by the parameters and in


2/25

72 5 Evaluating Scalability Parameters

(5.1). The parameter is a measure of the level of contention in the system,e.g., waiting on a database lock, while the parameter is a measure of thecoherency delay in the system, e.g., fetching a cache line.

Remark 5.1. If = 0, C(p) reduces to Amdahls law. This is why Amdahl

scaling is necessary (to explain contention) but not sufficient (it cannot exhibita maximum).

In Sect. 5.5 of this chapter, we calculate the universal scalability param-eters based on benchmark measurements which span a set of processor con-figurations up to and including the full complement of processors allowed onthe backplane. In Sect. 5.7, we consider how these scalability predictions differwhen only a relatively small subset of processor configurations are measuredthe latter being the most likely situation in practice. All of the examples are

presented using Excel (Levine et al. 1999)).

5.2 Benchmark Measurements

We begin by describing the particulars of the benchmarked system. Moredetails can be found in (Atkison et al. 2000).

5.2.1 The Workload

The ray-tracing benchmark (Atkison et al. 2000) (available from sourceforge.net/projects/brlcad/) referred to throughout this chapter consists of com-puting six reference images from computer aided design (CAD) models(Fig. 5.1). These optical renderings are compared with reference images andverified for correctness. The images consist of 24-bit RGB pixel values (8 bitsfor each color channel). Images are considered correct if pixel channel valuesdiffer by no more than 1 from the reference. This accommodates the vari-ability due to differences in floating-point representation of the various CPUarchitectures.

The six reference models used represent differing levels of complexity. Thefirst three (moss, world, and star) represent simple validation of correctnesstests. The latter three models (Bldg391, M35, and Sphflake) strongly resem-ble those used in production analysis. The inclusion of these models assuresthat there is a strong correlation between performance on the benchmark andperformance in actual use.

The benchmark reports the number of ray-geometry intersections per-formed per second during the ray-tracing phase of the rendering. Time spentreading the model into memory and performing setup operations is not in-

cluded in the performance evaluation. This gives a more realistic predictionof actual performance, since typical applications spend relatively little time inI/O and setup compared to ray tracing. The amount of time the benchmarksspend doing I/O and setup is about equal to that spent raytracing.


3/25

5.2 Benchmark Measurements 73

Moss World Star

Bldg391 M35 Sphflake

Fig. 5.1. Reference images used in the ray-tracing benchmark

Table 5.1 summarizes the ray tracing benchmark results. It is generallymore illuminating to plot these data as shown in Fig. 5.2 in order to reveal theshape of the throughput curve. The homogeneity of the ray-tracing benchmarkworkload makes it a very useful candidate for analysis using our universalcapacity model (5.1). To that end we now present in detail how to evaluatethe and parameters from the data in Table 5.1.

Table 5.1. Ray tracing benchmark results on a NUMA architecture

Processors Throughputp X(p)1 204 788 130

12 17016 19020 20024 21028 23032 26048 28064 310


4/25


5.2.2 The Platform

The benchmark platform is an SGI Origin 2000 with 64 R12000 processorsrunning at 300 MHz. The Origin 2000 runs the IRIX operating systemaUNIX-based operating system with SGI-specific extensions1.

The Origin 2000 is a nonuniform memory architecture (NUMA) architec-ture in which the memory is partitioned across each processor. These phys-ically separate memory partitions can be addressed as one logically sharedaddress space. This means that any processor can make a memory referenceto any memory location. Access time depends on the location of a data wordin memory (Hennessy and Patterson 1996). The performance did not increaseuniformly with the number of processors. Three of the benchmark codes exhib-ited a definite ceiling in performance between p = 10 and p = 12 processors.All of the benchmark codes showed considerable variability in each experi-

mental run.

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Processors (p)

KRaysperS

econd

Fig. 5.2. Scatter plot of the benchmark throughput data in Table 5.1

Each processor board contained two CPUs, local memory, and two remotememory access ports. Up to six processors can be interconnected without a

router module to the crossbar switch. Up to eight CPUs can be placed in a sin-gle backplane. To accommodate more than eight CPUs, multiple backplanes

1 At the time of writing SGI has announced that production of its IRIX/MIPS linewill end on 29th December 2006.


5/25

5.3 Minimal Dataset 75

are interconnected through their respective memory buses. Accesses from oneCPU module to another are satisfied in the order of the bus distance to theCPU module containing the desired data. Thus, when large numbers of CPUscontend for common data, CPUs in the neighborhood of a cluster can starve

remote CPUs of access. The great variability of performance in higher num-bers of CPUs can be explained by the distribution of the computation acrossthe available nodes of the system. Work appears to be distributed randomlyacross the available CPUs, but during execution the operating system mayreschedule work on different a CPU without any apparent reason.

5.2.3 The Procedure

Below, we summarize the steps to be carried out in the remainder of this chap-ter for analyzing the scalability of the ray-tracing benchmark. As the followingsteps are carried out, they can be incorporated into an Excel spreadsheet likethat shown in Fig. 5.3.

The procedural steps for the calculation of and and C(p) are as follows:

1. Measure the throughput X(p) for a set of processor configurations p.2. Preferably include an X(1) measurement and at least four or five other

data points.3. Calculate the capacity ratio C(p) defined in (5.2) (Sect. 5.4).

4. Calculate the efficiency C/p, and its inverse p/C (Fig. 5.3).5. Calculate the deviation from linearity (Sect. 5.5.2).6. Perform regression analysis on this deviation data to calculate the

quadratic equation coefficients a,b,c (Sect. 5.5).7. Use these coefficients a,b,c to calculate the scalability parameters ,

(Sect. 5.6.2).8. Use the values of , to calculate the processor configuration p where

the maximum in the predicted scalability occurseven though it maybe a theoretical configuration (Sect. 5.6.3). p is defined by (4.33) in

Chap. 4.9. Use and to predict the complete scalability function C(p) over thefull range of p processor configurations (Fig. 5.7).

5.3 Minimal Dataset

In this chapter we are going to use a form of polynomial modeling (Box et al.1978, p. 482). An nth degree polynomial in the variable x has n + 1 terms

such that: y = nxn + + 1x + 0 . (5.3)

The basic idea is to estimate the value of y by determining the n+1 coefficientson the right-hand side of (5.3).


6/25


Measured KRays/Sec RelCap Efficiency Inverse Linearity Deviation

CPU (p) X(p) C=X(p)/X(1) C/p p/C p-1 (p/C)-1

1 20 1.00 1.00 1.00 0 0.00

4 78 3.90 0.98 1.03 3 0.03

8 130 6.50 0.81 1.23 7 0.23

12 170 8.50 0.71 1.41 11 0.41

16 190 9.50 0.59 1.68 15 0.68

20 200 10.00 0.50 2.00 19 1.00

24 210 10.50 0.44 2.29 23 1.29

28 230 11.50 0.41 2.43 27 1.43

32 260 13.00 0.41 2.46 31 1.46

48 280 14.00 0.29 3.43 47 2.43

64 310 15.50 0.24 4.13 63 3.13

Fig. 5.3. Example spreadsheet including the normalized capacity, efficiency, andlinear deviation calculations

Polynomial models are sometimes referred to ambiguously as linear ornonlinear models. This is because (5.3) is linear with respect to the modelcoefficients 1, since the degree of all the parameters is equal to one, but it isnonlinear with respect to the x-variables since there are terms with exponentsgreater than one. We want to estimate the value of the model coefficients 1.How many data points do we actually need to estimate the parameters in ouruniversal scalability function? There are two ways to look at this question.

5.3.1 Interpolating Polynomial

One possibility for esimating the parameters in the universal scalability modelis to require the polynomial curve to pass through all the data points. Thesimplest polynomial is a straight line:

y = 1x + 0 , (5.4)

which has one variable (x) and two coefficients. We need at least two datapoints to unambiguously determine the slope and the y-intercept.

The general rule is that we need at least n+1 data points to unambiguouslydetermine an nth-degree polynomial. In Table 5.1 there are eleven data points;therefore we could unambiguously determine a tenth-degree polynomial.

As we shall see in Sect. 5.5.1, our universal scalability model can be as-sociated with a second-degree polynomial or a quadratic equation. From thestandpoint of unambiguously determining the interpolating polynomial, weshould only need three data points. However, because the measured through-puts involved systematic and statistical errors (see Chap. 3), we cannot expectthose data to lie exactly on the curve corresponding to the universal scalabilitymodel. For this, we need regression analysis.

5.3.2 Regression Polynomial

Regression analysis is a technique for estimating the values of the coefficientsin (5.3) based on statistical techniques to be described in Sect. 5.6. In general,


7/25

5.5 Transforming the Scalability Equation 77

we more than the three data points required to determine a simple interpo-lating polynomial.

On the other hand, you do not necessarily require a data set as completethat in Table 5.1. In Sect. 5.7, we discuss how to perform scalability analysis

with fewer measured configurations. In any event, it is advisable to have atleast four data points to be statistically meaningful. This is to offset the factthat it is always possible to fit a parabola through three arbitrary points.Hence, four data points should be considered to be the minimal set.

5.4 Capacity Ratios

Having collected all the benchmark throughput measurements in one place,

let us now turn to the second step in the procedure outlined in Sect. 5.2.Referring to the benchmark data in Table 5.1, we first calculate the relativecapacity C(p) for each of the measured configurations. The single-processorthroughput is at X(1) = 20 kRays/s. Therefore, that capacity ratio (5.2) is:

C(1) =X(1)

X(1)=

20

20= 1.0 .

Similarly, with p = 64 processors:

C(64) = X(64)X(1)

= 31020

= 15.50 .

All intermediate values of C(p) can be calculated in the same way.

Remark 5.2. This result already informs us that the fully loaded 64-way plat-form produces less than one quarter of the throughput expected on the basisof purely linear scaling.

Additionally, we can compute the efficiency C/p and the inverse efficiency

p/C for each of the measured configurations. The results are summarized inTable 5.2. We are now in a position to prepare the benchmark data in Excelfor nonlinear regression analysis (Levine et al. 1999) of the type referred toin Sect. 1.4.

5.5 Transforming the Scalability Equation

Unfortunately, since (5.1) is a rational function (see Definition 4.8), we cannot

perform regression analysis directly using Excel. As shall see in Sect. 5.5,Excel does not have a option for this kind of rational function in its dialogbox (Fig. 5.5). We can, however, perform regression on a transformed versionof (5.1). The appropriate transformation is described in the following sections.


8/25


Table 5.2. Relative capacity, efficiency, and inverse efficiency

p C C/p p/C 1 1.00 1.00 1.004 3.90 0.98 1.03

8 6.50 0.81 1.2312 8.50 0.71 1.4116 9.50 0.59 1.6820 10.00 0.50 2.0024 10.50 0.44 2.2928 11.50 0.41 2.4332 13.00 0.41 2.4648 14.00 0.29 3.4364 15.50 0.24 4.13

5.5.1 Efficiency

As a first step, we divide both sides of (5.1) by p to give:

C(p)

p=

1

1 + (p 1) + p(p 1). (5.5)

This is equivalent to an expression for the processor efficiency (cf. Defini-

tion 4.5 in Chap. 4).

5.5.2 Deviation From Linearity

Second, we simply invert both sides of (5.5) to produce:

p

C(p)= 1 + (p 1) + p(p 1) . (5.6)

This form is more useful because the right-hand side of (5.6) is now a sim-

ple second-degree polynomial in the p variable. Overall, (5.6) is a quadraticequation having the general form:

y = ax2 + bx + c (5.7)

defined in terms of the coefficients: a, b, and c. The shape of this functionis associated with a parabolic curve (see, e.g., www-groups.dcs.st-and.ac.uk/history/Curves/Parabola.html). The general shape of the parabola iscontrolled by the coefficients in the following way:

The sign of a determines whether (5.7) has a maximum (a > 0) or aminimum (a < 0).

The location of the maximum or minimum on the x-axis is determined byb/2a.


9/25


c determines where the parabola intersects the y-axis.

Equation (5.7) is the nonlinear part of the regression referred to in Sect. 5.4.Excel, as well as most other statistical packages, can easily fit data to such aparabolic equation.

5.5.3 Transformation of Variables

Finally, we need to establish a connection between the polynomial coefficientsa,b,c in (5.7) and the parameters , of the universal model (5.1). Note,however, that we have more coefficients than parameters. In other words, wehave more degrees of freedom in the fitting equation than the universal modelallows. Since we are not simply undertaking a curve-fitting exercise, we needto constrain the regression in such a way that:

There are only two scaling parameters: , . Their values are always positive: , 0.

This can most easily be accomplished by reorganizing the inverted equation(5.6) by defining the new variables:

Y =p

C(p) 1 , (5.8)

and

X = p

1 , (5.9)so that (5.6) becomes:

Y = X2 + ( + )X (5.10)

Notice that (5.10) now resembles (5.7) with c = 0, which means that they-intercept occurs at the origin (see Fig. 5.4). Eliminating c is the necessaryconstraint that matches the number of regression coefficients to the two pa-rameters in (5.1).

This change of variables, (5.8) and (5.9), is only necessary if you do the

regression analysis in Excel since it does not offer rational functions as amodeling option. The transformed variables can be interpreted physicallyas providing a measure of the deviation from ideal linear scaling discussedin Sect. 5.5.2. Table 5.3 shows some example values for the ray-tracingbenchmark described in Sect. 5.2.

The parabolic shape of the regression curve (5.10) can be seen clearly inFigs. 5.10 and 5.13. It is not so apparent in Fig. 5.7 because the overalldeviation from linearity is milder and the regression curve is therefore veryclose to a straight line. This is explained further in Sect. 5.5.4.

Theorem 5.1. In order for the scalability function (5.1) to be a concave func-tion, the deviation from linearity (5.7) must be a convex function, viz., aparabola (Fig. 5.4).


10/25


Proof. Equation (5.7) is the inverse of (5.1). See the discussion in Sect. 5.5.2.

Theorem 5.2. The relationship between the universal model parameters , and the nonzero quadratic coefficients a, b is given by:

= a , (5.11)

and = b a . (5.12)

Proof. Equate coefficients between (5.7) and (5.10):

a = , (5.13)

which is identical to (5.11), and

b = + . (5.14)

Substituting (5.13) into (5.14) produces (5.12). The c coefficient plays no rolein determining and because its value was set to zero.

In Chap. 4 we derived (4.33) to calculate the location p of the maximumin the scalability function C(p). We can also use (5.13) and (5.14) to write itexplicitly in terms of the regression coefficients, viz.

p =

1 + a ba

, (5.15)

which can be useful as an independent validation in spreadsheet calculations.

5.5.4 Properties of the Regression Curve

It is important that both of the scalability parameters, and , are alwayspositive in order that the physical interpretation presented in Sect. 4.4.1 be

valid. The positivity of these parameters is tied to the values of the regressioncoefficients a and b, which in turn depend very sensitively on the values ofthe transformed variables X and Y associated with the measured data. Inparticular, the requirements:

(i) a, b 0(ii) b > a

ensure that , cannot be negative. They are requirements in the sense thatthey cannot be guaranteed by the regression analysis. They must be checked

in each case. If either of the regression coefficients is negative, in violation ofrequirement (i), the analysis has to be reviewed to find the source of the error.Similarly, requirement (ii) means that the magnitude of the linear regressioncoefficient b always exceeds that of the curvature coefficient a.


11/25


20 40 60 80 100X

10

20

30

40

50

Y

a = 0, b 0

a = 0, b > 0

a > 0, b > 0

Fig. 5.4. Schematic representation of the allowed regression curves (5.10) for the de-viation from linearity Y defined by (5.8) versus X defined by (5.9). The upper curvecorresponds to the most general case of universal scalability (, > 0) with regres-sion coefficients a,b > 0 in (5.7), viz., a parabola. The middle curve correspondsto Amdahl scaling ( = 0) with regression coefficients a = 0, b > 0 (straight line).The bottom curve represents the approach to ideal linear scalability with regressioncoefficients a = 0 and b 0 (horizontal line).

Corollary 5.1. It follows from Theorem 5.2, together with requirement (ii)(b > a) and the Definition 4.3 of seriality ( < 1), that

a b < a + 1 . (5.16)

Theoretically, the regression coefficient a > 0 can have an arbitrary value onR+, but in practice a 1 and is often close to zero.

Figure 5.4 depicts the allowed regression curves for (5.10). It shows why

the above requirements must hold. The upper curve is a parabola which meetsrequirement (i) and corresponds to the most general case of universal scalabil-ity. The middle curve is a straight line corresponding to universal scalabilitywith = 0, or equivalently, regression coefficients a = 0, b > 0. This regressioncurve belongs to Amdahls law discussed previously in Sect. 4.3.2. The bottomcurve is a nearly horizontal line with regression coefficients a = 0 and b 0representing the approach to ideal linear scalability previously discussed inSect. 4.3.1.

Another way to understand the above requirements is to read Fig. 5.4 as

though there was a clock-hand moving in the anticlockwise direction startingat the X-axis. As the clock-hand moves upward from the X-axis, it starts tosweep out an area underneath it. This movement corresponds to the regressioncoefficient b starting to increase from zero while the a coefficient remains


12/25


zero-valued. As the b clock-hand reaches the middle inclined straight line,imagine another clock-hand (belonging to a) starting in that position. Atthat position, the area swept out by the b clock-hand is greater than thearea sweep out by the a clock-hand, since the latter has only started moving.

Hence, requirement (ii) is met. As we continue in an anticlockwise direction,the two clock-hands move together but start to bend in Daliesque fashion toproduce the upper curve in Fig. 5.4.

Having established the connection between the regression coefficients andthe universal scalability parameters, we now apply this nonlinear regressionmethod to the ray-tracing benchmark data of Sect. 5.2.

Table 5.3. Deviations from linearity based on the data in Table 5.2

X = p 1 Y = (p/C) 1

0 0.003 0.037 0.23

11 0.4115 0.6819 1.0023 1.2927 1.4331 1.4647 2.43

63 3.13

5.6 Regression Analysis

The simplest way to perform the regression fit in Excel is to make a scatterplot of the transformed data in Table 5.3. Once you have made the scatter plot,

go to the Chart menu item in Excel and choose Add Trendline. This choicewill present you with a dialog box (Fig. 5.5) containing two tabs labeled Typeand Options.

5.6.1 Quadratic Polynomial

The Type tab allows you to select the regression equation against which youwould like to fit the data. Usually, there is a certain subjective choice in thisstep, but not in our case. We are not seeking just any equation to fit the data;

rather we have a very specific physical model in mind, viz., our universalscalability model. Therefore, according to Sect. 5.5.3, we select a Polynomialand ratchet the Degree setting until it equals 2. This choice corresponds tothe quadratic equation (5.10) in the regressor variable X.


13/25

5.6 Regression Analysis 83

Fig. 5.5. Type dialog box for the Excel Trendline

Next, select the Options tab, as shown in Fig. 5.5. The corresponding

dialog box for this choice is shown in Fig. 5.6. Select each of the checkboxesas follows:

Set intercept = 0 (this numerical value may have to be typed in) Display equation on chart Display R-squared value on chart

The first checkbox forces the c coefficient in (5.7) to be zero, as required bythe discussion in Sect. 5.5.3. Checking the second box causes the numericalvalues of the regression coefficients calculated by Excel to be displayed as partof the quadratic equation in the resulting chart (see Fig. 5.7). Checking thethird box causes the corresponding numerical value of R2 to be displayed inthe chart as well.

Remark 5.3. The quantity R2 is known to statisticians as the coefficient ofdetermination (See e.g., Levine et al. 1999). It is the square of the correlationcoefficient, and is interpreted as the percentage of variability in the data that isaccounted for by (5.10) and, by inference, the universal seriality model. Valuesin the range 0.7 < R2 1.0 are generally considered to indicate a good fit.

5.6.2 Parameter Mapping

The result of this Trendline fit in Excel is shown in Fig. 5.7. The Excel chartshows the transformed data from Table 5.3 along with the fitted quadratic


14/25


Table 5.4. Regression coefficients as reported by Excel in Fig. 5.7 for the deviationfrom linearity

Coefficient Valuea 5 106

b 0.0500c 0.0000R2 0.9904

Fig. 5.6. Options dialog box for the Excel Trendline

curve (dashed curve) as well as the calculated quadratic equation and the R2

value. The calculated a, b coefficients are summarized in Table 5.4. We seethat R2 = 0.9904, which tells us that better than 99% of the variability in thebenchmark data is explained by our scalability model.

The scalability parameters and can be calculated by simply substi-tuting the values from Table 5.4 into (5.11) and (5.12). The results are sum-marized in Table 5.5. This ends the regression analysis, but we still need togenerate the complete theoretical scaling curve using (5.1) and interpret thesignificance of the and values for these benchmark data.

Remark 5.4 (Excel Precision Problems). The values computed by Excel in Ta-

ble 5.5 suffer from some serious precision problems. You are advised to readAppendix B carefully for more details about this issue. This raises a dilemma.It is important from a GCaP standpoint that you learn to perform quanti-tative scalability analysis as presented in this chapter, and Excel provides a


15/25

5.6 Regression Analysis 85

y = 5E-06x2 + 0.05x

R2 = 0.9904

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Effective Processors (p - 1)

LinearDeviation

Fig. 5.7. Regression curve (dashed line) for deviation from linearity based onall measured configurations. The exceedingly small value of the x2 coefficienta = 5 106 indicates that the deviation is very slight and the regression curveis only weakly parabolic

Table 5.5. Universal scalability parameters calculated by substituting the regressioncoefficients of Table 5.4 into Eqns. (5.11), (5.12), and (5.15)

Parameter Value

0.0500 5 106

p 435

very accessible tool for doing that. However, because of its potential precisionlimitations, known to Microsoft (support.microsoft.com/kb/78113/), youshould always consider validating its numerical results against those calculatedby other high-precision tools, such as Mathematica, R, S-PLUS or Minitab.

5.6.3 Interpreting the Scalability Parameters

The resulting scalability curve is compared to the original measurements inFig. 5.8. Several remarks can be made.

Table 5.6 shows that the largest discrepancy between the model and mea-surement is 10% at p = 4 processors. The contention parameter at 5% is


16/25


relatively large, as we saw in our discussion of Amdahls law in Chap. 4. Itwas later determined that this high contention factor was due to efficienciesin the NUMA bus architecture and certain compiler issues. The coherencyparameter , on the other hand, is relatively small. This suggests that there

are likely to be few cache misses, very little memory paging, and so on.The maximum in the scalability curve is predicted by (5.15) to occurat p = 435 processors. Clearly, this is a purely theoretical value, since it isalmost seven times greater than the maximum number of physical processorsthat can be accommodated on the system backplane.

0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Processors (p)

KRaysperSecond

Modeled

Measured

Fig. 5.8. Theoretical scalability (dashed line) predicted by (5.1) with the parametervalues in Table 5.5. It is a concave function. As expected, very few data points actu-ally lie on the curve. Inferior performance of configurations in the range 16 < p < 32suggests significant performance improvements may be possible

5.6.4 Error Reporting

Because the theoretical scalability function in Fig. 5.8 is monotonically in-creasing, it is preferable to use the absolute errors in Table 5.6 rather than

the relative or percentage errors described in Chap. 3. The largest absoluteerror occurs at p = 12, whereas the last two columns in Table 5.6 shows,that largest relative error occurs at p = 4 with the relative errors for largerconfigurations monotonically decreasing.


17/25

5.7 Less Than a Full Deck 87

Example 5.1. Consider a small p-configuration that has a measured through-put of, say, 4 units and a deviation between the measured and modeled value of1 unit. This will produce a percentage error of 25%. At larger p-configurationswhere the measured throughput is, say, 100 units, the same deviation only pro-

duces a percentage error of only 1%.

The absolute errors or residuals in Table 5.6 provide a point estimate forthe error. Error reporting can be improved even further by repeating multi-ple throughput measurements at each p-configuration. Then, a complete setof statistics can be generated using analysis of variance (ANOVA) in Ex-cel (Levine et al. 1999). We did not have repeated measurements in this case.The F-test indicates the significance of the overall regression (Box et al. 1978).The larger the F-value, the more significant the regression. Statistical p-valuesfrom an ANOVA table in Excel provides a significance test of the parameter

regression. We take up this ANOVA procedure in more detail in Chap. 8.

Table 5.6. Error reporting for the scalability data in Fig. 5.8

Processor Estimated Measured Absolute Relativep X(p) X(p) Error Residual Error Error%

1 20.00 20.00 0.00 0.00 0.00 0.004 69.56 78.00 8.44 8.44 0.12 12.138 118.50 130.00 11.50 11.50 0.10 9.71

12 154.78 170.00 15.22 15.22 0.10 9.8316 182.74 190.00 7.26 7.26 0.04 3.9720 204.94 200.00 4.94 4.94 0.02 2.4124 222.98 210.00 12.98 12.98 0.06 5.8228 237.93 230.00 7.93 7.93 0.03 3.3332 250.51 260.00 9.49 9.49 0.04 3.7948 285.63 280.00 5.63 5.63 0.02 1.9764 306.97 310.00 3.03 3.03 0.01 0.99

5.7 Less Than a Full Deck

You may be thinking that the universal scalability model worked well in theprevious example because the measured data set spans intermediate processorconfigurations up to a fully loaded system containing 64 processors. How welldoes this regression method work when there is less data? We consider thefollowing typical cases:

(a) A sparse subset of data that is evenly spaced.(b) A sparse subset of data that is unevenly spaced.(c) The X(1) datum (used for normalization) is absent.


18/25


Keep in mind that one of the advantages of this regression procedure is thatit can be applied to sparse data, and, what is more important, sparse datafrom small-processor configurations.

A significant virtue of applying regression analysis to the universal scala-bility model is that it can make scalability projections based on a limitednumber of measurements. Modeling projections can be more cost-effectivethan actual measurements. Remember that there are always significant er-rors in measured performance data. (See Chap. 3) In particular, setting upand performing measurements on small platform configurations are usuallymuch less expensive than large platform configurations.

5.7.1 Sparse Even Data

Suppose we only had a subset of the benchmark measurements on the four pro-cessor configurations shown in Table 5.7. Note that the corresponding scatterplot appears in Fig. 5.9

Table 5.7. Benchmark measurements on four evenly spaced small-processor config-urations

Processors Throughputp X(p)

1 204 788 130

12 170

These data should correspond to a low-load or low-contention region in thescalability curve. Recall from Sect. 5.3 that four data points is the minimumrequirement for meaningful regression (see Fig. 5.10). The projected scalability

in Fig. 5.11 is strikingly different from Fig. 5.8. The estimated value is onlyone fifth of the original regression analysis in Table 5.5, whereas the estimated value is one thousand times higher!

Table 5.8. Regression coefficients and scalability parameters for the evenly spacedsmall-processor configurations in Table 5.7

Regression ScalabilityCoefficient Value Parameter Value

a 0.0023 0.0103b 0.0126 0.0023

R2 0.9823 p 20


19/25


However, R2 = 0.9823 suggests that the regression fit is very good. The value implies that serial contention is not an issue. Rather, it is the ex-tremely high value of which suggests that some kind of thrashing effect atplaypresumably in the memory architectureis responsible for pulling the

throughput maximum down by a factor of twenty-five.

0

20

40

60

80

100

120

140

160

180

0 2 4 6 8 10 12 14

Processors (p)

KRaysperSe

cond

Fig. 5.9. Scatter plot of benchmark throughput data for the uniformly spacedconfigurations in Table 5.7

Another interpretation of Fig. 5.11 is that scalability is being prematurelythrottled around p = 16 processors, but it then displays a tendency to begin

recovering above p = 32 processors, which lies outside the measurement rangein the scenario we are considering here. Referring to the platform architecturedescribed in Sect. 5.2.2, p = 16 processors implies the existence of two inter-connect buses. The temporary adverse impact on scalability observed in thedata could be a result of coherency delays to memory references across thesemultiple buses.


20/25


y = 0.0023x2 + 0.0126x

R2 = 0.9823

0.0

0.1

0.1

0.2

0.2

0.3

0.3

0.4

0.4

0.5

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64


LinearDeviation

Fig. 5.10. Linear deviation for the benchmark data in Fig. 5.9

This is precisely how it should be. Predictions based on the universal scal-ability model depend significantly in the amount of data provided as inputto parameterize the model. As with all models, providing either additionaldata values or new values for previously measured configurations can heav-ily influence the modeling projections. This is another reason that errorreporting (Sect. 5.6.4) is so important.

5.7.2 Sparse Uneven Data

Next, suppose we have the subset of four nonuniformly spaced data points inTable 5.9 and Fig. 5.12. What effect does this spacing have on the scalabilitypredictions?

The fitted deviation from linearity is shown in Fig. 5.13. Again, these dataare expected to correspond to a low-contention region in the scalability curve.The projected scalability in Fig. 5.14 now appears to be more intermediate inshape between that of Fig. 5.11 and Fig. 5.8. However, R2 = 0.9979 indicatesthat the regression fit is still very good.

The estimated value in Table 5.10 is now increased to approximatelytwice the value in Table 5.8. On the other hand, the estimated value of isreduced by more than one third of the value in Table 5.8. As a consequence,the maximum in the scalability curve has moved upward to p = 28 processors.


21/25


0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Processors (p)

KRaysperSecond

Modeled

Unknown

Measured

Fig. 5.11. Predicted scalability for the limited benchmark data in Table 5.7 (blackdots) compared with the complete dataset (crosses) in Table 5.1, which is assumedto be unknown in this example. Clearly, the predicted scalability depends quitesensitively on what is known and what is unknown. This is exactly how it should be

Table 5.9. Benchmark measurements on four small-processor configurations span-ning nonuniform intervals

Processors Throughputp X(P)1 204 788 130

28 230

The reason for this impact on the regression analysis derives from the fact thatwe have more information about the system throughput at larger processorconfigurations. The fourth data point in Table 5.9 carries information aboutthe slight recovery from memory thrashing noted in Sect. 5.7.1.

5.7.3 Missing X(1) Datum

The procedure described in this chapter is based on normalizing the bench-mark throughput measurements to X(1) the single processor throughputvalue. Unless the performance measurements are planned with the univer-sal scalability model in mind, it is not uncommon for X(1) to be absent, thus


22/25


0

50

100

150

200

250

0 5 10 15 20 25 30

Processors (p)

KRaysperSecond

Fig. 5.12. Scatter plot of benchmark throughput data for the nonuniformly spacedconfigurations in Table 5.9

y = 0.0012x2 + 0.0211x

R2 = 0.9979

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64


Line

arDeviation

Fig. 5.13. Linear deviation for the benchmark data in Fig. 5.12


23/25


0

50

100

150

200

250

300

350

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

Processors (p)

KRaysperSecond

Modeled

Unknown

Measured

Fig. 5.14. Predicted scalability for the limited benchmark data in Fig. 5.12 (blackdots) compared with the complete dataset (crosses) in Table 5.1 which is assumedto be unknown in this example. Clearly, the predicted scalability depends quitesensitively on what is known and what is unknown. This is exactly how it should be

Table 5.10. Regression coefficients and scalability parameters for the small-processor unevenly spaced configurations in Table 5.9

Regression ScalabilityCoefficient Value Parameter Value

a 0.0012 0.0199b 0.0211 0.0012

R2 0.9979 p 28

rendering this form of regression analysis unusable. Is there any way aroundthis problem?

There are two answers to this question. First, having read this far, youshould make every effort to make universal scalability analysis part of yourfuture Guerrilla capacity planning. Second, you can do a form of regressionanalysis on the raw data without normalizing it, but this may involve moretrial and error than described in this chapter.

A simpler rule of thumb is to simply estimate X(1) by taking the first

measured throughput value, say X(p0), and dividing by that processor con-figuration p0:

X(1) =X(p0)

p0. (5.17)


24/25


This estimate for the p = 1 datum is likely to be too low because it is tanta-mount to assuming linear scalability in the range p = 1 to p = p0. Worse yet,(5.17) may cause the regression parabola to develop a negative curvature, i.e.,a < 0 in (5.7) in violation of requirement (i) in Sect. 5.5.4 If this situation

arises, it can be corrected by manually adjusting (5.17) upward with someusually small amount , so that the corrected estimate becomes:

X(1) =X(p0)

p0+ .

Example 5.2. Suppose the X(1) datum was missing from the benchmark datain Table 5.1. We can estimate it using the first available value X(4) = 78.Using (5.17):

X(1) =X(4)

4= 19.50 . (5.18)

In this case = 0.5, so it makes no appreciable difference to the regressionanalysis in Excel. X(1) and X(1) are the same number, to 2-sigdigs (seeChap. 3).

Finally, it is worth remarking that this kind of estimation scheme maybe more applicable for the assessment of software-based scalability where asingle-user load measurement may have its own peculiar set of challenges. Wetake up this point in Chap. 6.

5.8 Summary

In this chapter we have applied our universal scalability model (5.1) to aray-tracing benchmark executing on a distributed-memory NUMA architec-ture, despite the lack of any terms in the equation which account for thatarchitecture. How is that possible?

Conjecture 4.1 states that the two parameters in the universal model are

necessary and sufficient: > 0 means that C(p) becomes sublinear, while, > 0 means it becomes retrograde. A capacity maximum is not evidentin the ray-tracing benchmark data, although one is predicted to occur out ofrange at p = 435 processors.

Moreover, (5.1) does not contain any explicit terms representing the inter-connect, type of processors, cache size, operating system, application software,etc. All that information is contained in the respective regression values of and . It is in this sense that (5.1) is universal, and therefore must containNUMA as an architectural subset. As validation, in Sect. 5.6.3 we uncoveredbus-contention in our analysis that suggested potential performance improve-ments which indeed were implemented.

The procedural steps for applying regression analysis in Excel to estimatethe scalability parameters was described in Sect. 5.2.3. Rather than requiring


25/25

5.8 Summary 95

three or more parameters, the physical nature of the universal model con-strains the number of regression coefficients to just two; thus avoiding vonNeumanns criticism.

We now move on to examine how these quantitative scalability concepts

can be applied to software applications.

Documents

05-Evaluating Scalability Parameters