CPR: Composable Performance Regression for Scalable Multiprocessor … · CPR: Composable Performance Regression for Scalable Multiprocessor Models Benjamin C. Lee y Microsoft Research

CPR: Composable Performance Regression for Scalable Multiprocessor Models

Benjamin C. Lee †

Microsoft [email protected]

Jamison Collins, Hong WangIntel Corporation

{hong.wang,jamison.d.collins}@intel.com

David BrooksHarvard University

[email protected]

Abstract

Uniprocessor simulators track resource utilization cycle bycycle to estimate performance. Multiprocessor simulators,however, must account for synchronization events that in-crease the cost of every cycle simulated and shared resourcecontention that increases the total number of cycles simulated.These effects cause multiprocessor simulation times to scalesuperlinearly with the number of cores.

Composable performance regression (CPR) fundamentallyaddresses these intractable multiprocessor simulation times,estimating multiprocessor performance with a combinationof uniprocessor, contention, and penalty models. The unipro-cessor model predicts baseline performance of each corewhile the contention models predict interfering accesses fromother cores. Uniprocessor and contention model outputs arecomposed by a penalty model to produce the final multi-processor performance estimate. Trained with a productionquality simulator, CPR is accurate with median errors of 6.63,4.83 percent for dual-, quad-core multiprocessors. Further-more, composable regression is scalable, requiring 0.33x thesimulations required by prior regression strategies.

1. Introduction

Modern simulator infrastructure is ill-suited to handle cur-rent technology trends and the transition to multiprocessors.Cycle-accurate multiprocessor simulation inherently lacksscalability; simulation times increase superlinearly with thenumber of simulated cores and simulators are not easily paral-lelized [1]. Most challenges in microarchitectural simulationarise from the need for a high degree of synchronization.Detailed uniprocessor simulation proceeds cycle by cycle,tracking resource utilization to produce a detailed estimateof performance at the cost of long simulation times. Justas multiprocessors are built from multiple instantiations ofuniprocessor cores, multiprocessor simulators often run mul-tiple instantiations of a uniprocessor simulator. However, mul-tiprocessor simulators must also account for inter-processor

† This work was done while B. Lee interned at Intel and studied at Harvard.

synchronization that increase the cost of every cycle simulated(e.g., snoops for coherence) and shared resource contentionthat increases the total number of cycles simulated (e.g.,stalls due to eviction from shared cache lines). Collectively,these effects cause multiprocessor simulation times to scalesuperlinearly with the number of cores.

Recent advances in applying statistical inference to themicroarchitectural domain have enabled qualitatively newcapabilities in uniprocessor analysis [2], [3], [4], [5], [6].Uniprocessor inferential models leverage best known prac-tices in statistical inference for highly efficient simulationand analysis. Trained by simulating points sparsely sampledfrom the design space, these models are computationallyefficient surrogates for cycle-accurate simulation, which cap-ture mappings between design parameters and design metrics.However, these same techniques are unlikely to scale as weconsider multiprocessor performance models. While a fewhundred training simulations are tractable fixed costs foruniprocessor modeling, they are prohibitively expensive formultiprocessor modeling due to superlinear increases in sim-ulation time. Thus, multiprocessor regression models derivedfrom multiprocessor simulations alone are impractical.

Exacerbating these greater multiprocessor simulationtimes, multiprogrammed workloads produce combinatorialgrowth in the number of possible workload combinations. Thenumber of such combinations will increase rapidly with thenumber of cores. Taking the 18 benchmarks in this paper forexample, we are confronted with C18

2 = 153 dual-core andC18

4 = 3060 quad-core workload combinations. Furthermore,for industrial microprocessor design, 18 benchmarks are onlya small representative set from a much more comprehensivestudy list with over 500 benchmarks. For 500 benchmarks,we could potentially analyze C500

2 = 1.3E+05 dual-core andC500

4 = 2.6E + 09 quad-core workload combinations. Thus,in addition to ensuring scalability as the number of coresincreases, an effective multiprocessor modeling frameworkmust also ensure the marginal cost of model training for eachcombination of workloads is computationally tractable.

To address these fundamental challenges in multiprocessorsimulation times, we propose composable performance re-

gression (CPR). CPR extends prior successes in uniprocessorstatistical inference to the multiprocessor domain while ensur-ing training costs remain tractable. The framework controlstraining costs by abstracting the uniprocessor, taking the coreas the primary determinant of multiprocessor performanceand considering shared resource contention as a secondarypenalizing effect. Within this framework, uniprocessor modelsare the primary building blocks supported by secondarycontention and penalty models. Such a framework is highlyefficient, building on the information captured by relativelyinexpensive uniprocessor models while requiring a far fewernumber of multiprocessor simulations to train contention andpenalty models to predict multiprocessor performance.

We define scalability as the rate of increase in training costswhen the number of modeled cores increases. Composablemodels are trained and evaluated separately but composed toestimate a final response. CPR uses composable models todeliver scalability for multiprocessor performance estimates.The following summarizes the key contributions of this work:

1) Industrial Infrastructure: We demonstrate regressionmodels’ effectiveness for the Intel Core microarchitec-tural family, using industry-strength simulators activelyemployed in product development. We construct thesemodels for a broad array of workloads ranging frommultimedia to server applications. (Section 2)

2) Uniprocessor Model: We first derive uniprocessorregression models, demonstrating the accuracy of thesebasic building blocks before applying them in themultiprocessor model. Furthermore, we demonstratethe application of these models to evolutionary designoptimization, re-tuning design parameters after fun-damentally new microarchitectural features are addedbetween consecutive product generations. (Section 2.3)

3) Multiprocessor Model: We propose composable per-formance regression that includes uniprocessor, con-tention, and penalty models. The uniprocessor modelpredicts the baseline performance of each core whilethe contention models predict interfering accesses toshared resources from other cores. Uniprocessor andcontention model outputs are composed in a penaltymodel that produces the final multiprocessor perfor-mance prediction. (Section 3)

4) Case for Scalable Multiprocessor Models: In addi-tion to being individually accurate, the uniprocessor,contention and penalty models combine to predict dual-core and quad-core performance with median errorsof 6.63 and 4.83 percent, respectively. Composableregression is a scalable, efficient approach for con-structing these multiprocessor models. Training costsfor proposed composable regression are 0.33x those fornaively constructed models as the number of multipro-cessor cores increases. (Section 4)

Collectively, this work establishes a rigorous foundation forefficiently extending advances in statistical machine learningfor uniprocessor design into the multiprocessor domain.

2. Methodology and Background

We develop a scalable and efficient simulation paradigm formultiprocessor design that defines a comprehensive designspace, simulates sparse samples from the space, and con-structs regression models for performance prediction. Weextend prior work for uniprocessor models to those formultiprocessors [6].

2.1. Regression Modeling

Microarchitecture designers often use performance mod-els to predict a performance response y as a function ofdesign parameter values ~x. Such models may be expressedas y = F (~x) where F (~x) may be a detailed, cycle-accuratesimulator or an empirical model fit using simulated data [7].While evaluating F (~x) using detailed simulation is the mostwidely taken approach, the significant computational costs ofsimulation often hinder the design process. These difficultieshave led to computationally efficient surrogates for detailedsimulators that predict the response as y = F (~x) + e, whereF (~x) is an approximation of the simulator response and e isthe approximation error.

We construct the surrogate F (~x) by deriving spline-basedregression models from a sparse sampling of the design space[8], [6]. These models express the predicted response asy = F (~x) = α+ ~βTS(~x)+e where α is the model intercept,~β is a vector of regression coefficients and S(~x) is a splinefunction that produces piecewise polynomials. This splinetransformation is applied to each xi in ~x by dividing thedomain of xi into intervals joined at intersections called knots.Separate polynomials are fit to data within each interval toproduce a piecewise polynomial. For example, a cubic splineon xi with three knots at a, b, and c is given by Equation (1)where (u)+ = u if u > 0 and (u)+ = 0 otherwise.

S(xi) = β0 + β1xi + β2x2i + β3x

3i + β4(xi − a)3+

+β5(xi − b)3+ + β6(xi − c)3+ (1)

Within this framework, interactions between predictors ofperformance are captured by product terms using domain-specific knowledge. For example, pipeline depth x1 interactswith cache sizes x2 since depth determines pipeline sensitivityto cache misses. This interaction would be captured by aproduct term x3 = x1x2. These interactions are applied afterspline transformations. We prune the number of terms from aninteraction S(xi)S(xj) by removing doubly non-linear terms(i.e., keep only product terms with at least one linear factor).

Digital Home1 audio audio conversion [9]2 video video compression [9]3 photo photoshop album

Games4 unreal Unreal Tournament5 halflife Half-Life, modified Quake engine6 homeworld Homeworld, three-dimensional movement

Multimedia7 mentalray rendering, ray tracing8 painter raster graphics package9 tachyon ray tracing

Office10 outlook personal information manager11 access relational database management system12 excel spreadsheet application

Productivity13 md2 OpenSSL cryptographic hash function14 encrypt file encryption [9]15 flash multimedia player

Server16 specweb web server [10]17 tpcc on-line transaction processing [11]18 specjapp J2EE 1.3 application servers [12]

Table 1. Benchmarks

2.2. Simulation Framework

We use an execution-driven, cycle-accurate, micro-operation(uop) based IA-32 simulator that models a superscalar, out-of-order microprocessor belonging to the Intel Core microarchi-tectural family. The simulated multiprocessor implements aring-based interconnect. Each node of the ring is comprised ofa superscalar, out-of-order execution IA-32 core and privateL1, L2 caches. Each node also contributes capacity to anL3 cache shared among all nodes using a snoopy coher-ence protocol. The simulator implements a detailed memorysubsystem that fully models interconnect topologies and anyassociated contention. Both uniprocessor and multiprocessormodels are validated against product hardware.

The simulated microprocessor executes inputs called LongInstruction Traces (LIT’s). A LIT is not actually a trace, but acheckpoint of processor and memory state used to initialize anexecution-based performance simulator. A LIT also includesa list of injections, which are system interrupts needed tosimulate events like direct memory accesses (DMA). Sincethe LIT includes a complete snapshot of system memory,we are able to simulate both user and kernel instructions,as well as wrong-path instructions. We report experimentalresults based on LIT’s of the benchmarks in Table 1. Thesebenchmarks represent a broad range of application areasranging from the digital home to the server.

2.3. Design Space

Table 2 defines a microprocessor design space with fifteengroups of parameters that characterize key microarchitecturalstructures. Parameters within each group are varied togetherto avoid fundamental design imbalances (e.g., the L/S buffer

Set Parameters Range |Si|S1 Branch Pred. target buffer entries 256 :: 2x :: 1024 3S2 target buffer assoc 2 :: 2x :: 16 4S3 Stride Pred. table entries 128 :: 2x :: 512 3S4 Loop Detector table entries 4 :: 2+ :: 12 5S5 Mem Disamb. table entries 32 :: 2x :: 512 5S6 Out-of-Order L/S buffer entries 24 :: 8+ :: 96 10

reorder buffer entries 72 :: 12+ :: 180reserv. stations 28 :: 4+ :: 64

S7 Fill Buffer buffer entries 4 :: 2+ :: 12 5S8 Data TLB table entries 128 :: 2x :: 512 3S9 table assoc 1 :: 2x :: 8 4S10 Inst Cache victim cache entries 4 :: 2+ :: 12 5S11 Data Cache size (KB) 8 :: 2x :: 128 5S12 assoc 2 :: 2x :: 16 4S13 L2 Cache size (KB) 256 :: 256+ :: 1024 4S14 assoc 2 :: 2x :: 16 4S15 L3 Cache size (MB) 1.0 :: 0.5+ :: 3.0 5

Table 2. Design space parameters where i::j::kdenotes a set of values from i to k in steps of j.

and reorder buffer should vary together in set S6). The Carte-sian product of these sets, S =

∏15i=1 Si, defines the overall

design space of approximately 4.3 billion points. Using aneffective spatial sampling technique [6], which decouplesdesign space size from the number of simulations neededto identify design trends, we find 300 samples obtaineduniformly at random from the design space are sufficient foraccurate regression.Uniprocessor models are the basic building blocks for ourmultiprocessor modeling framework. Regression models en-able rapid design space exploration via fast, accurate predic-tions. Once constructed, model usage depends very much onthe design process. While academic research might explorea large design space and consider each design without bias,industrial product development usually introduces innovativefeatures to the previous product generation to improve par-ticular design metrics (e.g., performance, power, area, cost).

Microprocessor development is driven by regular, alternat-ing advances in silicon and microarchitecture [13]. To ensuresilicon advances translate into better user experiences, themicroprocessor industry must produce evolutionary improve-ments across each product generation. These improvementsmight be delivered via fundamentally new algorithmic orpolicy features and/or a better allocation of microarchitecturalresources. However, new features usually introduce subtle,unknown, or unintended interactions with existing microar-chitectural resources, thereby complicating product design.For example, a new performance-enhancing feature mightallow designers to meet a particular performance target withfewer resources, thereby saving power and area. Furthermore,designers might add several new features into a single productgeneration, each with potentially conflicting influences on astructure’s optimal size.

We evaluate these scenarios through a case study, demon-strating our regression models can be effectively applied tothe industrial, evolutionary design process. In particular, we

Figure 1. Uniprocessor Model: Error distributions for models of design spaces built around two consecutive micropro-cessor generations ProcX (L) and ProcY (R).

consider two consecutive generations within the same IA-32Core Microarchitecture family, which we refer to as ProcXand ProcY. ProcY is an evolutionary progression from ProcXand includes several new microarchitectural features. Asidefrom these new features, ProcX and ProcY share the samedesign space of Table 2. We use per benchmark regressionmodels to re-optimize and re-balance the design after thesenew features are added. Such a case study illustrates theimportance of holistic design analysis in evolutionary productdevelopment.

2.4. Uniprocessor Regression Model Validation

We construct regression models for the uniprocessor designspace using 300 randomly collected training samples. Figure1 illustrates error distributions when validated against simula-tion for 50 randomly collected validation points. Sensitivitystudies indicate this validation set is sufficiently large anderror distributions do not change significantly with additionalsamples. These boxplots display location (median) and dis-persion (interquartile range), identify possible outliers, andindicate the symmetry or skewness of the distribution.1 Thesefigures demonstrate median errors of 1.43 and 0.77 percentfor ProcX and ProcY performance prediction, respectively.Maximum errors and statistical outliers rarely exceed 10percent.

1. Boxplots are constructed by (1) horizontal lines at median and at upper,lower quartiles, (2) vertical lines drawn up/down from upper/lower quartileto most extreme data point within 1.5 IQR of upper/lower quartile whereIQR is the interquartile range between first and third quartile, and (3) circlesto denote outliers.

2.5. Case Study: Evolutionary Design Optimization

While prior academic work has demonstrated regressionmodels can be effective for designing microprocessors froma blank slate [14], industry is more likely to implementevolutionary enhancements by adding new microarchitecturalenhancing features to existing designs, thereby extendinga processor family. For example, Intel’s Yonah, Dothan,Merom, and Penryn are four consecutive evolutionary gen-erations of the Intel Core Microarchitecture family. In sucha design scenario, regression models enable a comprehensivere-balancing of the microprocessor design. Re-balancing isthe process of taking optimal design values from one gener-ation, accounting for new features in the next generation byconstructing new models, and determining how previouslyoptimal values change given these new features. For thisstudy, we evaluate three categories of microarchitecturalenhancements:• Improved Front-End: An improved front-end will in-

crease fetch throughput and the number of in-flightinstructions. Improved branch prediction might be deliv-ered by larger history tables, thereby increasing resourcerequirements. Alternatively, better prediction schemesmight make better use of fewer resources, enablingdesigners to reduce resource sizes and save power withnegligible performance penalties.

• Improved Memory Subsystem: Improved cache linereplacement or prefetch policies will impact cache hier-archy design. For example, improved loop detection andmemory stride prediction may enable more effective useof L1 instruction and data caches, respectively.

• Improved Out-of-Order Execute: Improved executionunits will impact queue and buffer occupancy throughout

Figure 2. BTB (L) and L1 data cache (R) design values that meet various delay targets from ProcX (without newfeatures) and ProcY (with new features). Note log-scaled x-axis.

Figure 3. L3 cache (L) and ROB (R) design values that meet various delay targets from ProcX (without new features)and ProcY (with new features). Note log-scaled x-axis for L3 cache.

the pipeline. For example, improved memory disam-biguation will reduce the average time a load must waitbefore issuing. Such a feature enhancement may reduceaverage occupancy of the memory order buffer whileimproving instruction throughput.

Suppose designers are given a range of delay targets,defined as CPI numbers relative to ProcX baseline per-formance.2 To re-balance the design after introducing newmicroarchitectural features, we explore the two design spacesof ProcX and ProcY, identifying designs that satisfy particulartargets before and after the addition of new features. Limitedsensitivity studies are insufficient since they do not identify

2. Designers often identify targets for performance improvements fromone generation to the next.

specific combinations of design values that meet a particulartarget. For example, one-dimensional sensitivity studies ofL1, L2 and L3 cache sizes cannot identify all combinationsof cache sizes that meet a delay target and, furthermore, theycannot identify cases where one parameter value increasesand another decreases with zero net performance impact.

Instead of one-dimensional sensitivity, we use regressionmodels for more comprehensive optimization, first identifyingall parameter value combinations that meet each CPI delaytarget and then assessing salient trends within these combina-tions. Since regression models may identify multiple designsthat satisfy a given CPI delay target, we use boxplots to showthe range of parameter values represented in these designs.Figures 2–3 illustrate the impact of new microarchitectural

features on design optimization. Parameter values are shownon the x-axis and CPI delay targets are shown on the y-axis.

Figure 2L illustrates reductions in branch target buffer(BTB) associativity after new microarchitectural features areadded to the front-end (e.g., better branch predictor). Of thedesigns meeting a delay target of 0.90 (shaded in green),median associativities for ProcX and ProcY designs are 16and 4, respectively. For every delay target, ProcY designsmeet the same target as ProcX designs but with fewer BTBways. Fewer BTB ways translate into fewer buffer entries,reduced comparator logic, and potential power savings for agiven delay target. These savings are realized only by re-optimizing the design after adding a front-end enhancingfeature, such as a better branch predictor.

Similarly, Figure 2R illustrates potential reductions in theL1 data cache size due to an improved back-end (e.g., betterprefetch). For various delay targets, the median cache size forProcY designs is consistently half the median size for ProcXdesigns (e.g., 32 KB versus 64 KB for 0.90 target; shadedin green). These results suggest memory-enhancing features,such as more intelligent prefetching, might allow designersto attain the same performance despite a smaller cache.Furthermore, more effective L1 cache usage will likely impactthe whole cache hierarchy as shown by similar trends for theL3 cache in Figure 3L. Thus, designers might meet a givendelay target using fewer resources, thereby saving power, byre-optimizing the cache hierarchy when more efficient cachepolicies are added.

Figure 3R considers a scenario for the reorder buffer (ROB)where several new microarchitectural features might exertcompeting influences. A more effective front-end (e.g., branchpredictor) might increase the number of in-flight instructions,potentially increasing the optimum ROB size. However, amore effective memory subsystem (e.g., prefetcher) mightreduce the number of long latency memory stalls as loadshit in the L1 data cache more often. Fewer memory stalls in-crease instruction commit throughput, potentially decreasingthe optimum ROB size. Figure 3R shows ProcY designs meetdelay targets with much lower ROB occupancy than ProcXdesigns, suggesting improvements to the memory subsystemdominate and recommending a net reduction in ROB size.Relying solely on intuition, the net impact on ROB size isnot obvious. Regression models, however, allow us to explorethe design space and account for interactions between newmicroarchitectural features and various resources.

3. Composable Performance Regression

Unique challenges in multiprocessor simulation for multi-programmed workloads prevent us from naively applyingstatistical modeling methodologies previously found effectivein the uniprocessor domain [2], [3], [5], [6]. Simulating

Figure 4. CPR overview of components, interactions.

sampled training designs becomes significantly more expen-sive when these samples are drawn from the multiprocessordesign space. Furthermore, deriving separate models for eachcombination of workloads each with its own set of trainingdata is intractable due to rapid combinatorial growth in thenumber of possible multiprogrammed workloads. To addressthese fundamental challenges, we take the canonical approachof using the uniprocessor core as the primary determinantof multiprocessor performance and consider shared resourcecontention as a secondary penalizing effect.

3.1. Overview

Composable regression addresses the unique challenges ofmultiprocessor performance modeling as summarized in Fig-ure 4. Let ~x specify the vector of parameter values charac-terizing a design and ~xs ⊂ ~x specify the sub-vector of thesedesign parameters that characterize resources shared betweenprocessors. Consider a set of benchmarks B = {B1, . . . , Bn}executing on an n-core multiprocessor. The framework itera-tively predicts the performance of each benchmark in B. Con-sider iteration i where we take Bi as the benchmark executingon the core of interest while simultaneously contending withthe other benchmarks B6=i = {B1, . . . , Bi−1, Bi+1, . . . , Bn}.As shown in Figure 4, this framework predicts the baselineuniprocessor performance of Bi executing on design ~x. Inparallel, the framework predicts contention indicators for Biwhen it contends with B 6=i for shared resources ~xs. A penaltymodel computes a linear combination of baseline performanceand contention indicators to estimate the performance ofBi when contending with B 6=i benchmarks in a symmetricmultiprocessor with n cores of design ~x.

3.2. Training

Figure 5 illustrates uniprocessor model construction. Thismodel requires Nuni samples from the full design space.A uniprocessor simulator takes these inputs to provide ob-served CPI delay values Di(~x) for a particular benchmarkBi. These delay values are fit to the design values ~x with

Figure 5. Uniprocessor training with spline-based re-gression on full parameter space.

Figure 6. Contention training with spline-based regres-sion on shared resource space.

restricted cubic splines (i.e., piecewise cubic polynomialsdenoted by transformation S), producing a surrogate modelfor the uniprocessor simulator. This surrogate is the samemodel discussed and applied in Section 2.3. The model usesthree knots for each design parameter of Table 2. Interactionsare specified between caches sizes in adjacent levels ofthe memory hierarchy. The resulting model estimates theuniprocessor CPI delay of benchmark Bi as a function oftransformed design values: Di(~x) = fi(~x).

Figure 6 illustrates contention model construction. Thismodel requires Ncon samples from the space of sharedresources ~xs, a subset of the full design space. Ncon < Nunisince the contention model considers only a subset of the fullparameter space and thus produces a smaller model requiringfewer training simulations. Although cores share only the L3cache, we consider L1 data, L2, and L3 caches in ~xs sincethe design of higher cache levels determine L3 cache traffic.An n-core multiprocessor simulator takes these shared designvalues to provide observed contention indicators Ci(~xs | B)for a particular combination of benchmarks B. Although weuse cycle-accurate simulation to produce these contentionindicators, trace simulation may be sufficient (e.g., cachesimulation), further reducing contention model training costs

Figure 7. Penalty training with linear regression on fullparameter space.

relative to uniprocessor training costs.We define Ci(~xs | B) as a vector of three contention

indicators: d-L1 misses, L2 hits, and L3 hits. We find accuratecontention models are more easily constructed for the threewe choose. Intuitively, these metrics are suitable contentionindicators since they correlate with average memory accesslatencies and the memory intensity of the workloads. Thesecontention indicators are fit to design values of shared re-sources ~xs with restricted cubic splines, producing a surrogatefor multiprocessor simulations. The regression model usesfour knots for the d-L1, L2 caches and five knots forthe L3 cache. Additional spline flexibility is assigned tothe L3 cache due to its significance as the primary pointof contention in our simulated multiprocessor. Interactionsare specified between cache sizes in adjacent levels of thememory hierarchy. Such interactions capture locality effectsand are necessary for inclusive caches. The resulting modelestimates multiprocessor contention seen by benchmark Biexecuting with B6=i as a function of shared design values:Ci(~xs | B) = gi(~xs | B).

Figure 7 illustrates penalty model construction. This modelrequires Npen samples from the full design space ~x. Amultiprocessor simulator takes these inputs to provide ob-served multiprocessor delay values Di(~x | B) for a particularbenchmark Bi executing with B 6=i. Unlike the uniprocessorand contention models, which are constructed using sim-ulation and design parameter values, the penalty model isconstructed using simulation and outputs from the other twomodels. Specifically, the multiprocessor delay values are fitto predictions from the uniprocessor and contention models(i.e., Di(~x) and Ci(~xs | B), respectively). The resultingregression model predicts the delay of benchmark Bi whenexecuting with other benchmarks in B as a function of designparameters. This prediction is made via a uniprocessor delaymodel fi(~x) and multiprocessor contention model gi(~xs). Theoverall result is a composed regression model: Di(~x | B) =hi

(fi(~x), gi(~xs | B) | B

).

In contrast to uniprocessor and contention models, thepenalty model is less complex and, therefore, less expensiveto construct. Quantifying model complexity by the numberof terms in the model (i.e., number of regression coefficientsthat must be fitted), the penalty model has complexity ad-vantages from taking fewer predictors and not performingspline transformations. The penalty model takes only fourpredictors: baseline uniprocessor performance Di(~x) andthree contention metrics Ci(~xs | B): d-L1 misses, L2 hitsand L3 hits. In contrast, the uniprocessor model takes fifteenpredictors, one for each design parameter in Table 2. Fur-thermore, spline transformations are applied to each of thesefifteen predictors, which significantly increases the numberof terms in the uniprocessor model as shown in Equation(1). Thus, composable regression models package uniproces-sor performance and multiprocessor contention informationfor a fifteen-element design vector ~x into a more compactfour-element vector {Di(~x), Ci(~xs | B)} for more efficient,complexity-effective linear regression. The efficiency andcomplexity advantages translate into fewer multiprocessorsimulations required for model construction. A smaller modelthat fits fewer coefficients will require less data for training.

3.3. Prediction

Consider an n-core symmetric multiprocessor executingbenchmarks B = {B1, . . . , Bn}. Given the three componentsof the composable regression model, we iteratively evaluatethe performance of each core i for design ~x.

1) Uniprocessor Prediction: Evaluate uniprocessor per-formance of Bi executing for design ~x. ComputeDi(~x) = fi(~x).

2) Contention Prediction: Evaluate multiprocessor con-tention of Bi executing with B6=i on a symmetricmultiprocessor for a particular configuration of sharedresources ~xs. Compute Ci(~xs | B) = gi(~xs | B).

3) Multiprocessor Prediction: Evaluate multiprocessorperformance of Bi executing with B 6=i on a sym-metric multiprocessor for core designs ~x. ComputeDi(~x | B) = hi

(fi(~x), gi(~xs | B) | B

).

This process is repeated for every benchmark executing onthe multiprocessor to get per core CPI delays.

4. Multiprocessor Model Evaluation

We illustrate the effectiveness of composable regression mod-els, showing these models accurately estimate multiproces-sor performance. Furthermore, model construction costs arescalable in contrast to intractable exhaustive simulation andinefficient naive regression.

Dual-Core BenchmarksSet (1) (2)1 painter homeworld2 access mentalray3 specjapp specweb4 homeworld tachyon5 dense flash

Quad-Core BenchmarksSet (1) (2) (3) (4)1 dense excel flash md22 video specjapp specweb tachyon3 excel homeworld audio unreal4 outlook encrypt halflife homeworld5 painter mentalray outlook encrypt

Table 3. Multicore Benchmarks

4.1. Accuracy Analysis

We validate each component in the composable regressionframework against detailed simulation before evaluating itsoverall accuracy. We construct each model by sparsely sim-ulating points sampled uniformly at random from the de-sign space. We train these three model components withNuni = 300 and Ncon = Npen = 50. We separatelysimulate another 50 randomly sampled points for validation.This is a sufficiently large validation set; further increasingthe validation set size does not significantly change the errordistribution. We evaluate these validation points for five setsof benchmark combinations sampled uniformly at randomfrom the space of possible combinations as shown in Table3. We validate against detailed multicore simulators used byproduct teams. The product development infrastructure usedin this study scales up to four cores.

Figure 1R in Section 2.3 illustrates uniprocessor modelaccuracy. We build the composable framework to modelmicroprocessor cores from the ProcY design space. As de-scribed in Section 2.3, these models achieve median errors of0.77 percent relative to detailed simulation. In the context ofour composable framework, these errors refer to the relativedifference between simulation and the uniprocessor regressionmodel fi(~x).

Figures 8–9 illustrate the error distributions for contentionindicators predicted by the model gi(~xs | B). Figure 8 andFigure 9L present errors for the dual-core case. Contentionindicators are predicted across all benchmarks with medianerrors of 3.17, 2.46, and 1.30 percent for d-L1 misses,L2 hits and L3 hits, respectively. Median errors for eachbenchmark are usually below 5 percent and always below10 percent. Figure 9R presents L3 hit prediction errors forthe quad-core case and is representative of error distributionsfor other quad-core contention indicators. The median erroracross all benchmarks is 3.06 with most benchmarks reportingmedian errors less than 10 percent. Although the variance(i.e., spread) of the error distributions for quad-core L3cache predictions, these errors do not propagate into our finalmultiprocessor performance errors. Overall, these models are

Figure 8. Error distributions for contention models predicting dual-core d-L1 cache misses (L), L2 cache hits (R).

Figure 9. Error distributions for contention models predicting dual-core L3 cache hits (L), quad-core L3 cache hits (R).

Figure 10. Error distributions for dual-core (L) and quad-core (R). Models predict per core CPI.

sufficiently accurate for early stage design optimization.The third benchmark set (excel-homeworld-audio-unreal)

exhibits particularly high error rates for quad-core L3 hitswith median errors of 15.29 percent across the four bench-marks. However, we did not find any systematic errors toaccount for the difference. We might hypothesize the thirdbenchmark set more significantly exercises the shared L3such that the model becomes less accurate as contentionincreases. However, we did not find any evidence to supportthis hypothesis. The second set has more accurate models de-spite accessing the shared L3 more frequently and exhibitinggreater contention. Similarly, the first and fourth sets exhibitL2 miss rates comparable to the third set. However, both thefirst and fourth sets have more accurate contention modelsdespite greater L2 miss rates that lead to shared L3 accesses.Overall, the third set is more an exception than the generalcase and we do not find any evidence to suggest its greatermodel error is due to greater L3 contention.

Figure 10 illustrates the error distributions for multiproces-sor delays hi

(fi(~x), gi(~xs | B) | B

)for dual- and quad-core

benchmark sets. The dual-core model accurately predicts theperformance of each core with median errors of 6.63 percentacross all benchmarks. Median errors for specific benchmarksets range from 1.6 percent (fourth set: homeworld-tachyon)to 11.1 percent (third set: specjapp-specweb). Similarly, thequad-core model predicts the performance of each core withmedian errors of 4.83 percent across all benchmarks. Medianerrors for specific benchmark sets range from 1.31 percent(fifth set: painter-mentalray-outlook-encrypt) to 18.62 percent(first set: dense-excel-flash-md2).

Note that larger quad-core contention errors for the thirdbenchmark set in Figure 9R do not necessarily translateinto larger quad-core CPI errors in Figure 10R. The penaltymodel that produces the final CPI prediction is likely ro-bust to errors from uniprocessor or contention models sincethe linear regression fits multiprocessor performance to theprovided predictions regardless of errors. If fit well, thepenalty model might mask the intermediate uniprocessor orcontention errors. However, a fit becomes less likely as sucherrors increase.

Figure 10R reveals larger errors for the first quad-corebenchmark set (dense-excel-flash-md2). This set is notablefor its extremely low cache activity; other benchmark setsaccess all levels of hierarchy more frequently. Specifically, onaverage, the other four benchmark have 2.1x, 1.5x, and 1.7xmore d-L1 misses, L2 hits, and L3 hits, respectively. Due tosuch low activity, the penalty model is fit poorly for the firstbenchmark set. Significance testing on the three contentionindicators suggest d-L1 misses, L2 hits, and L3 hits do notcontribute much to model accuracy for the first quad-corebenchmark set, probably because these indicator values aresmall in absolute terms due to low cache activity.

4.2. Scalability Analysis

CPR accurately estimates the performance of each corein a symmetric multiprocessor by leveraging inexpensivelyconstructed uniprocessor models. We make the case for theefficiency of composable multiprocessor efficiency from ananalytical study of regression training costs expressed interms of simulation time, showing CPR incurs low marginaltraining costs as core count increases.

As in Section 3, let Nuni, Ncon, and Npen be the number ofdesign samples required to train the uniprocessor, contention,and penalty models, respectively. Let Tn be the time requiredfor a single detailed n-core simulation. Similarly, let Mn bethe time required to train a n-core model. Equations (2)–(3) express model construction time in terms of simulationcosts. The best case scenario is captured by a lower boundM

(lo)n , in which we assume all uniprocessor models have

already been created and the marginal cost of multiprocessormodeling is the time to construct the contention and penaltymodel. The worst case scenario is captured by an upperbound M

(up)n , in which we must also construct each of

the n uniprocessor models. The best case scenario is morelikely since architects are unlikely to study multiprocessorswithout first understanding uniprocessors with uniprocessorregression models for benchmarks of interest.

M (lo)n = (Ncon +Npen) · Tn (2)

M (up)n = Nuni · n · T1 + (Ncon +Npen) · Tn (3)

M (nv)n = Nnaive · Tn (4)

For comparison, we consider naive regression modelingthat implements, without modification, prior uniprocessorapproaches for multiprocessor analysis [14]. For n-core mul-tiprocessor, this naive approach samples Nnaive points fromthe multiprocessor design space, evaluates each sample withn-core simulation, and fits a spline-based regression modelto this training data. The cost of this approach is expressedas M (nv)

n in Equation (4).Tn = nγ · T1 expresses multiprocessor simulation times

in terms of uniprocessor simulation times where γ is agrowth factor. In the ideal case, γ = 1 and simulation timesscale linearly with the number of cores. However, for ourproduct simulators, γ > 1 due to the additional complexityof coherence, synchronization modeling that adds overheadto each cycle simulated and shared resource contention thatincreases the number of cycles simulated for each benchmark.Such simulation time scaling for various growth factors arerepresentative of those observed in practice from simulationwall clock times. Superlinearly increasing multiprocessorsimulation times have also been observed for other industrialand academic simulators [1].

Figure 11. Modeling training costs for γ = 1.2 (L) and γ = 1.4 (R).

These growth factors are dependent on particular simulatorimplementations, but we take growth factors of 1.2 and 1.4to model varying workload behavior across multiprogrammedcombinations or varying degrees of multiprocessor simula-tion efficiency. In practice, we observe 1.2 < γ < 1.4for our product simulators. For example, T4 ≈ 5.3T1 andT8 ≈ 12.1T1 in a representative γ = 1.2 simulator. Thesecosts increase significantly for the more inefficient γ = 1.4simulator where T4 ≈ 7.0T1 and T8 ≈ 18.4T1.

Figure 11 illustrates the number of required simulations totrain composable regression models for two growth factors:γ = 1.2, γ = 1.4. Costs are expressed in terms of the numberof uniprocessor simulations Mn/T1. Trends are illustratedfor Nuni = Nnaive = 300, Ncon = Npen = 50. Nuni =Nnaive conservatively assumes naive multiprocessor modelsand uniprocessor models can be fit with the same number oftraining samples. In practice, we expect Nnaive ≥ Nuni sincethe multiprocessor design space is at least as complex as theuniprocessor design space.

M(lo)n

M(nv)n

=(Ncon +Npen) · nγ

Nnaive · nγ=Ncon +Npen

Nuni(5)

M(up)n

M(nv)n

=Nuni · n+ (Ncon +Npen) · nγ

Nnaive · nγ(6)

Numbers tracking the composable regression lines in Fig-ure 11 indicate the cost advantage relative to naive regressionas computed by Equations (5)–(6). Figure 11 illustratestraining costs with superlinear increases in simulation time(γ = 1.2,γ = 1.4). For both values of γ, composableregression incurs 0.33x the costs of naive regression in thebest case (green circles versus blue up-triangles). This bestcase assumes a library of uniprocessor of regression modelshas already been constructed for benchmarks of interest.

According to Equation (5), composable regression costs willalways be 0.33x those for naive regression in the best casefor our particular combination of Nuni = 300, Ncon = 50,and Npen = 50.

However, in the worst case, every composable regressionmodel first requires the construction of uniprocessor models.For γ = 1.2, these worst case costs for compoosable modelsare comparable to those for naive regression (green circlesversus red down-triangles). However, for γ = 1.4, compos-able regression is much less expensive than naive regressionin all scenarios with at least four cores. Composable regres-sion incurs 0.66x to 0.91x the costs of naive regression forany model with more than two cores. Thus, even worst casecosts of composable regression are lower than those for naiveregression. This cost advantage will further improve for lessscalable multiprocessor simulators (increasing γ) and largermultiprocessors (increasing n).

Although we present both upper and lower bounds, sim-ulation costs for regression modeling will asymptoticallyapproach the lower bound over time. Even if designers startin the worst case scenario where no uniprocessor modelsexist, they will accumulate these models as they performdesign space studies. Over time, these accumulated modelswill likely span the space of uniprocessor workloads, leavingonly the marginal costs of constructing contention and penaltymodels, the scenario captured by our lower bound.

5. Related Work

Li, et al., decouple core and cache simulations in chipmultiprocessors [15]. The core simulator generates single-core traces of L2 cache accesses that are annotated withtimestamps. These traces feed a cache simulator to modelinteractions within shared caches. In contrast, we rely on

regression models to estimate core and cache effects. Further-more, we pass between models a few key contention metricsinstead of detailed traces.

Ipek, et al., and Joseph, et al., separately predict theperformance of design spaces with automated artificial neuralnetworks (ANN) trained by gradient descent and predicted bynested weighted sums [3], [4]. Joseph, et al., did not considerthe multiprocessor design space. Ipek, et al., train neuralnetworks for the multiprocessor design space by sampling andsimulating multiprocessor designs. In contrast, we constructcomposable models to reduce training simulation costs byrelying on a combination of training data from uniprocessor,shared resource, and multiprocessor simulations. Instead ofneural networks, we use spline-based regression models fromprior work by Lee and Brooks [6], [14]. Relative to neuralnetworks, regression may require greater statistical analysisduring construction, but may be more computationally effi-cient, numerically solving and evaluating linear systems fortraining and prediction, respectively.

Dubach, et al., reduce the marginal cost of building neuralnetworks by expressing a new benchmark’s performance asa linear combination of performance from already modeledbenchmarks [2]. The linear model, relying on performancetrends captured by previously constructed networks, is lessexpensive to construct than a full-fledged neural networktrained from scratch. Dubach, et al., implement this approachfor serial workloads and do not consider multiprocessors.Their proposed approach is orthogonal to ours and combiningthe two approaches is an avenue for future work.

Khan, et al., implement models for thread-parallelSPLASH workloads and speculatively parallelized SPECworkloads [5] for a four-core symmetric multiprocessor.Although the authors attempt to control training costs withtechniques similar to those by Dubach, et al., they rely solelyon multiprocessor simulations to train neural networks. Incontrast, we implement composable models, reducing thereliance on expensive multiprocessor simulations while lever-aging inexpensive uniprocessor and shared resource simula-tions. Furthermore, we extend the state-of-the-art in predictivemodeling to an industrial simulator infrastructure with a broadrange of workloads. In contrast to prior work, we considermultiprogrammed workloads and propose a framework tohandle combinatorial workload complexity, which is notpresent in thread-level parallel workloads.

6. Conclusions

Recent advances in applying statistical regression to micro-processor studies have enabled fundamentally new capabil-ities in uniprocessor design space exploration. To extendthese achievements to the multiprocessor domain, we proposecomposable regression models that combine uniprocessor,

contention, and penalty models to accurately predict multi-processor performance running multiprogrammed workloadswith median errors of 4.83 and 6.63 percent for dual- andquad-core predictions. Furthermore, composable regression isscalable, requiring 0.33x the simulations of naive regresionfor model construction. Collectively, this work establishes arigorous foundation for large-scale multiprocessor analysis.

Acknowledgment

This work is supported by NSF grant CCF-0048313 (CA-REER), Intel, and IBM. Any opinions, findings, and conclu-sions or recommendations expressed in this material are thoseof the author(s) and do not necessary reflect the views of theNational Science Foundation, Intel or IBM.

References

[1] J. Donald and M. Martonosi, “An efficient, practical paral-lelization methodology for multicore architecture simulation,”IEEE Computer Architecture Letters, August 2006.

[2] C. Dubach, T. Jones, and M. O’Boyle, “Microarchitectural de-sign space exploration using an architecture-centric approach,”in MICRO: International Symposium on Microarchitecture,December 2007.

[3] E.Ipek, S.A.McKee, B. de Supinski, M. Schulz, and R. Caru-ana, “Efficiently exploring architectural design spaces via pre-dictive modeling,” in ASPLOS: Architectural Support for Pro-gramming Languages and Operating Systems, October 2006.

[4] P. Joseph, K. Vaswani, and M. J. Thazhuthaveetil, “A predictiveperformance model for superscalar processors,” in MICRO: In-ternational Symposium on Microarchitecture, December 2006.

[5] S. Khan, P. Xekalakis, J. Cavazos, and M. Cintra, “Using pre-dictive modeling for cross-program design space explorationin multicore systems,” in PACT: International Conference onParallel Architectures and Compilation Techniques, Sept 2007.

[6] B. Lee and D. Brooks, “Accurate and efficient regression mod-eling for microarchitectural performance and power predic-tion,” in ASPLOS: International Conference on ArchitecturalSupport for Programming Languages and Operating Systems,October 2006.

[7] S. Duvall, “Statistical circuit modeling and optimization,” in5th International Workshop Statistical Metrology, June 2000.

[8] F. Harrell., Regression modeling strategies. Springer, 2001.[9] PCMark04, Futuremark Corporation.

[10] SPECweb99, Standard Performance Evaluation Corporation.[11] TPC-C v5, Transaction Processing Performance Council.[12] JAppServer2004, Standard Performance Evaluation Corpora-

tion.[13] S. Shenoy and A.Daniel, “Intel architecture and silicon ca-

dence: The catalyst for industry innovation,” Technology atIntel Magazine, October 2006.

[14] B. Lee and D. Brooks, “Illustrative design space studieswith microarchitectural regression models,” in HPCA: Inter-national Symposium on High-Performance Computer Architec-ture, February 2007.

[15] Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron, “CMP designspace exploration subject to physical constraints,” in HPCA:International Symposium on High-Performance Computer Ar-chitecture, February 2006.

Documents

CPR: Composable Performance Regression for Scalable Multiprocessor … · CPR: Composable Performance Regression for Scalable Multiprocessor Models Benjamin C. Lee y Microsoft Research