Proceedings of the Workshop on Binary Instrumentation and Applications 2009

8/3/2019 Proceedings of the Workshop on Binary Instrumentation and Applications 2009

1/72


2/72

The Association for Computing Machinery

2 Penn Plaza, Suite 701

New York New York 10121-0701

ACM COPYRIGHT NOTICE.

Copyright 2009 by the Association for Computing Machinery, Inc. Permission to

make digital or hard copies of part or all of this work for personal or classroom use

is granted without fee provided that copies are not made or distributed for profit or

commercial advantage and that copies bear this notice and the full citation on the

first page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, to

republish, to post on servers, or to redistribute to lists, requires prior specific

permission and/or a fee. Request permissions from Publications Dept., ACM, Inc.,

fax +1 (212) 869-0481, or [email protected].

For other copying of articles that carry a code at the bottom of the first or last page,

copying is permitted provided that the per-copy fee indicated in the code is paid through

the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, +1-978-750-

8400, +1-978-750-4470 (fax).

Notice to Past Authors of ACM-Published Articles

ACM intends to create a complete electronic archive of all articles and/or other material

previously published by ACM. If you have written a work that was previously published

by ACM in any journal or conference proceedings prior to 1978, or any SIG Newsletter

at any time, and you do NOT want this work to appear in the ACM Digital Library,please inform [email protected], stating the title of the work, the author(s), and

where and when published.

ACM ISBN: 978-1-60558-793-6


3/72

FOREWORD

Welcome to the Workshop on Binary Instrumentation and Applications 2009, the third in

the series. Instrumentation is an effective technique to observe and verify program

properties. This technique has been used for diverse purposes, from profile guided

compiler optimizations, to microarchitectural research via simulations, to enforcement ofsoftware security policies. While instrumentation can be performed at the source level as

well as binary level, the latter has the advantage of having the ability to instrument the

whole program, including dynamically linked libraries. Binary instrumentation also

obviates the need to have source code. As a result, instrumentation at the binary level has

become immensely useful and has growing popularity. This workshop provides an

opportunity for developers and users of binary instrumentation to exchange ideas for

building better instrumentation systems and new use cases for binary instrumentation,

static or dynamic.

The first session contains papers that use binary instrumentation to study and improve

hardware features. In "Studying Microarchitectural Structures with Object CodeReordering," Shah Mohammad Faizur Rahman, Zhe Wang and Daniel A. Jimnez

describe an approach to understand microarchitectural structures by running different

versions of the same program. Using branch predictors as an example, they vary the

object code layout. In "Synthesizing Contention," Jason Mars and Mary Lou Soffa

present a profiling approach for studying performance aspects of applications running on

multi-core systems due to interference from other cores. The approach creates contention

by running a synthetic application at the same time as the real application. In "Assessing

Cache False Sharing Effects by Dynamic Binary Instrumentation," Stephan M. Gnther

and Josef Weidendorfer use binary instrumentation to estimate the effects of false sharing

in caches for multi-threaded applications.

The second session contains papers that improve software systems. The ideas presented

improve existing binary instrumentation techniques or use binary instrumentation to

improve other software systems. In "Metaman: System-Wide Metadata Management,"

Daniel Williams and Jack W. Davidson describe an infrastructure to store and access

meta-information for programs during its entire build process and runtime, thereby

improving the build and runtime systems. In "A Binary Instrumentation Tool for the

Blackfin Processor," Enqiang Sun and David Kaeli provide a static binary

instrumentation system for embedded systems and use it to perform dynamic voltage and

frequency scaling as a case study. In "Improving Instrumentation Speed via Buffering,"

Dan Upton, Kim Hazelwood, Robert Cohn and Greg Lueck present a technique for

reducing instrumentation overhead by decoupling data collection (i.e. instrumentation)

from data analysis. Finally, in "ThreadSanitizer -- Data Race Detection In Practice,"

Konstantin Serebryany and Timur Iskhodzhanov present a new tool based on Valgrind

for detecting data races.


4/72

Thank you for participating in WBIA. We would like to especially thank the program

committee for their careful reviews and quick turnaround, and the organizers of Micro

2009 for providing a venue for this workshop.

Robert CohnJeff Hollingsworth

Naveen Kumar

Workshop organizers

December, 2009

Program Committee:

Derek Bruening, Vmware Robert Hundt, Google

Bruce Childers, University of Pittsburgh Naveen Kumar, Vmware

Robert Cohn, Intel Greg Lueck, Intel

Saumya Debray, University of Arizona Nicholas Nethercote, MozillaKoen De Bosschere, Ghent University Stelios Sidiroglou, MIT

Sebastian Fischmeister, U of Waterloo Alex Skaletsky, Intel

Jeff Hollingsworth, U of Maryland Mustafa Tikir, SDSC


5/72

Table of Contents

Session 1: Instrumentation for improving hardware

Studying Microarchitectural Structures with Object Code Reordering 7

Shah Mohammad Faizur Rahman, Zhe Wang and Daniel A. Jimnez

Synthesizing Contention 17

Jason Mars and Mary Lou Soffa

Assessing Cache False Sharing Effects by Dynamic Binary Instrumentation 26

Stephan M. Gnther, Josef Weidendorfer

Session 2: Instrumentation for improving software

Metaman: System-Wide Metadata Management 34

Daniel Williams and Jack W. Davidson

A Binary Instrumentation Tool for the Blackfin Processor 43

Enqiang Sun and David Kaeli

Improving Instrumentation Speed via Buffering 52Dan Upton, Kim Hazelwood, Robert Cohn and Greg Lueck

ThreadSanitizer - data race detection in practice 62

Konstantin Serebryany and Timur Iskhodzhanov


6/72


7/72

7

Studying Microarchitectural Structures withObject Code Reordering

Shah Mohammad Faizur Rahman Zhe Wang Daniel A. Jimnez

Department of Computer ScienceThe University of Texas at San Antonio

{srahman,zhew,dj}@cs.utsa.edu

ABSTRACT

Modern microprocessors have many microarchitectural fea-tures. Quantifying the performance impact of one featuresuch as dynamic branch prediction can be difficult. On onehand, a timing simulator can predict the difference in perfor-mance given two different implementations of the technique,but simulators can be quite inaccurate. On the other hand,real systems are very accurate representations of themselves,but often cannot be modified to study the impact of a newtechnique.

We demonstrate how to develop a performance model forbranch prediction using real systems based on object codereordering. By observing the behavior of the benchmarksover a range of branch prediction accuracies, we can estimatethe impact of a new branch predictor by simulating only thepredictor and not the rest of the microarchitecture.

We also use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predictor. Wesimulate several branch predictors using Pin and measurewhich hypothetical branch predictor has the highest corre-lation with the real one.

This study in object code reorder points to way to fu-ture work on estimating the impact of other structures suchas the instruction cache, the second-level cache, instruction

decoders, indirect branch prediction, etc.

1. INTRODUCTIONLast year at ASPLOS, Mytkowicz et al. showed us that,

in essence, each of us studies a given program from our ownvery limited p oint of view [15]. We measure the propertiesof the program behavior and report results and conclusions.But if we step outside of that limited point of view, justa little, the conclusions could be very different. For eachcombination of program and compiler options, there is awhole world of points of view, each represented by a differentperturbation of instruction and data addresses.

Their paper was provocatively named, Producing wrongdata without doing anything obviously wrong! The conclu-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WBIA 09, Dec 12, New York City, NYCopyright 2009 ACM 978-1-60558-793-6/12/09 ...$10.00.

sion is that, if I look at a program from one point of viewand you look at it from another, then any comparison of ourresults is meaningless, and we might even come to differentconclusions about whether a given optimization is a goodidea. We might even engage in measurement bias, i.e., wemight come to believe that an accidental improvement inprogram behavior is due to our own brilliant technique.

However, by sampling and observing many of these points,we can get a much better understanding of program behav-

ior. We can compare results with one another as well asanswer interesting questions about programs and the pro-cessors that run them.

In this paper, we present two techniques based on the ob-ject code reordering work of Mytkowicz et al.. A benchmarkprogram is compiled into object files, and then many binaryexecutable versions of that program are produced by linkingthe object files in different orders. Each binary is semanti-cally equivalent, but because the instruction addresses aredifferent, different conflicts will arise among microarchitec-tural structures such as the branch predictor and instructioncache. The situation is isomorphic to one in which we keepthe binary executable constant, but change the hash func-tions for these microarchitectural structures. Based on thisobject code reordering work, we describe two techniques:

1. We demonstrate how to develop a performance modelfor SPEC CPU 2006 benchmarks running on the In-tel Core 2 processor. The technique perturbs bench-mark executables to yield a wide variety of perfor-mance points without changing program semantics orother important execution characteristics such as thenumber of retired instructions. By observing the be-havior of the benchmarks over a range of branch pre-diction accuracies, we can estimate the impact of anew branch predictor by simulating only the predictorand not the rest of the microarchitecture.

2. We use the reordered object code to validate a reverse-engineered model for the Intel Core 2 branch predic-tor. Modern microprocessors come with sophisticated

branch predictors. If the organization of the branchpredictor is known to an architecture aware compiler,it can use the information to improve program per-formance. But unfortunately the organization of In-tel Branch predictor are not widely disclosed. Uzelacet al. introduced reverse engineering techniques usingsome microbenchmarks to expose the organization ofPentium M Branch Predictor[19]. We try to use sim-ilar techniques to get an idea about the organization


8/72

8

of Intel Core 2 branch predictor. Those microbench-marks provide a rough outline about the organizationof the branch predictor but there is no certain wayto tell whether our assumption is correct. We usethe reordered object code to validate our hypotheticalbranch predictor organization. We first reverse engi-neered the Intel Core 2 branch predictor to find outits organization. Then we use the reordered object

code to validate the predictor and reveal different at-tributes and characteristics of it. We simulate severalbranch predictors using Pin[10] and try to find out thehypothetical branch predictor which has the highestco-relation with the original one.

2. RELATED WORKIn this section we discuss related work.

2.1 Eliciting Performance VarianceMytkowicz et al. introduce the technique of object file re-

ordering for showing that different link orders of object files,as well as other seemingly random and harmless details of anexperimental setup, can yield significantly different perfor-mance [15]. That work indicts the ASPLOS community for

falling victim to measurement bias, i.e., allowing oneself tobelieve that some observed improvement in program behav-ior is due to ones own technique rather than a happy coinci-dence of experimental factors. Our work was partly inspiredby Mytkowicz et al.. We choose to see the phenomenon theyexposed as an interesting opportunity to develop a tool toexamine microarchitectural behavior.

Rubin et al. propose a framework to explore the space ofdata layouts using profile feedback to find layouts that yieldgood performance [17]. They point out that the generalproblem of optimal data layout is NP-hard and poorly ap-proximable. The space of data layouts is similar to the spaceof object file reorderings, and the impact of data layouts onthe data cache is similar to the impact of code placement onthe branch predictor and instruction cache.

2.2 Impact of Code Placement on PerformanceThe impact of code placement on performance has not

gone unnoticed in the academic literature. Many code im-proving transformations have been proposed based on codeplacement. Hatfield and Gerald [3], Ferrari [5], McFarling [11],Pettis and Hanson [16], and Gloy and Smith [6] present tech-niques to rearrange procedures to improve locality based onprofile data. Most of these techniques use a profiled weightedcall graph with edges weighted by call frequency. Proce-dures with high weight are placed close to one another toavoid conflict misses. Calder and Grunwald present branchalignment, an algorithm that seeks to minimize the num-ber of taken branches by reordering code such that the hotpath through a procedure is laid out in a straight line [1],thus minimizing the performance penalty of a discontinuous

fetch. Their technique improves performance by an averageof 5% on an Alpha AXP 21064. Young et al. present anear-optimal version of branch alignment [21]. Jimenez pro-poses a technique to use code placement to explicitly avoifdbranch mispredictions due to conflicts in the predictor ta-bles [8]. Knights et al. propose exploiting fortuitous objectcode orderings to improve performance [9].

Our performance model technique is not an optimization,but a tool for peering inside the microarchitecture using

code placement. If thoughtful code placement optimizationslike those mentioned above were widely adopted, our resultswould show less variance in execution behavior and less con-fidence in the regression lines. Nevertheless, most produc-tion code is not optimized with code placement in mind;thus, our results are widely applicable to real systems.

2.3 Estimating Performance of real systemsContreras and Martonosi use performance monitoring coun-

ters to develop a linear power model of the Intel XScale pro-cessor [2]. This approach can enable a technique capable ofquickly estimating future power behavior and adapting to itat run-time. Our performance model technique is similar inthat it uses performance monitoring counters to develop amodel of program behavior. However, we focus on modelingthe behavior of one program at a time to get very preciseinformation about the change in performance in response toa small change in the behavior of microarchitectural struc-tures, i.e., our work concentrates on a much finer level ofgranularity, and we focus on performance instead of power.

2.4 Reverse Engineering Branch PredictorsMilenkovic et al. [13] proposed a reverse engineering flow

focusing on P6 and Netburst architectures and suggested the

size and organization of BTB and the presence and lengthsof global and local histories. This flow does not include anyexperiments for determining the organization of predictorstructures indexed by program path information nor theirinternal operation. Uzelac et al. [19] introduced a new set ofexperiment flows to determine the organization and differentattributes of Pentium M microprocessor branch predictor.

However, all these experiment flows provide an idea aboutthe organization of the branch predictors based on the logicalreasoning of their experimental results. These results couldhave been interpreted differently to come up with slightlydifferent branch predictor organizations. Our object code re-ordering technique validates different simulated branch pre-dictors to find out the organization which is closest to theoriginal branch predictor.

3. APPROACHIn this section we describe our approach of studying mi-

croarchitecture structure with object code reordering. Thebasic idea is to execute code under many different reorder-ings, causing a wide variance in performance due to differ-ent accidental collisions in microarchitectural structures. Bymeasuring the resulting adverse microarchitectural events,we can build a performance model for the program and mi-croarchitecture, we can also validate a reverse-engineeredmodel for branch predictor.

3.1 Instruction Addresses in MicroarchitecturalStructures

Our approach exploits the fact that several microarchi-

tectural structures use a hash of instruction addresses. Forexample:

1. A 128-set instruction cache with 64 byte blocks wouldlikely use bits 6 through 12 of the instruction addressas the set index.

2. A branch direction predictor might index a table ofcounters using a combination of branch history andbranch address bits.


9/72

9

3. A branch target buffer (BTB) or indirect branch pre-dictor would use lower-order bits of the branch addressto index a table of branch targets.

Sometimes instruction addresses will accidentally collidein some microarchitectural structure. For example, conflictmisses in the instruction cache occur when the number ofblocks mapping to a particular set exceeds the associativity

of the cache. Although this phenomenon has been studiedin academic research, most compilers do not optimize toprotect against these kinds of conflicts.

Compiler writers are aware of uses of instruction addressesand write compilers to exploit these uses. For instance, acommon heuristic is to align the target of a branch on aboundary divisible by the number of bytes in a fetch blockto allow the fetch beginning at that target to read the max-imum number of instruction bytes in one cycle.

3.2 A Wide Range in PerformanceThese accidental conflicts result in adverse microarchitec-

tural events such as branch mispredictions, instruction cachemisses, BTB misses, etc. A particular layout of the code willresult in a particular number of accidental collisions with aparticular impact on performance. A different layout willresult in a different impact on performance. By exploring awide range of layouts, we can force a wide range of adverseperformance events to take place and explore a wide rangeof performances. Figure 1 shows the percent difference fromaverage performance as measured by CPI caused by 100 ran-dom but plausible code reorderings for the SPEC CPU 2006benchmarks running on Intel Core 2 processor. The graphis a violin plot, showing the probability density at each CPIvalue, i.e., the thickness at each CPI value is proportional tothe number of CPIs observed in that neighborhood. Clearly,some benchmarks are greatly affected by differences in in-struction addresses while some are less sensitive.

3.3 Causing CollisionsTo generate many random but plausible code layouts, we

use the technique of Mytkowicz et al., i.e., object-file re-ordering. We compile each benchmark once, then link it100 times, each with a different pseudo-randomly-generatedorder of the object files. The linker lays code out in the or-der in which it is encountered on the command line, so eachrandom ordering results in a different code layout.

We then execute each resulting binary executable five times,collecting performance monitoring counter information suchas number of instructions committed, number of branch mis-predictions, number of clock cycles, etc. We take the perfor-mance monitoring counter statistics that gave the medianperformance. Details of our infrastructure are given in Sec-tion 4

3.4 Making Predictions

Once the performance monitoring counter information hasbeen collected, we can begin using statistical tools to builda performance model. We use least-squares linear regressionto estimate the relationship between various microarchitec-tural events and performance outcomes.

Linear regression only works if we may confidently assumethat there is a linear relationship. Under normal circum-stances CPI and MPKI do indeed have a linear relation-ship: for each benchmark, there is an average misprediction

penalty, and each extra misprediction increases the numberof cycles by this penalty.

3.5 Validating Reverse-Engineered Branch Pre-dictor

We execute each reordered object code in native environ-ment and also in our simulated branch predictor environ-ment and measure the number of mispredictions. Different

layouts of object code of the same program will have con-siderably different number of mispredictions because of theprogram counter. By executing all of these object code re-orderings on different simulated branch predictor, we usecorrelation co-efficient to draw a comparison between theoriginal and our simulated branch predictors.

4. EXPERIMENTAL METHODOLOGYThis section describes the experimental methodology used

for this paper.

4.1 CompilerWe use the Camino compiler infrastructure [7]. This sys-

tem is a post-processor for the Gnu Compiler Collection(GCC) version 4.2.4. C, C++, and FORTRAN programsare compiled into assembly language, the assembly languageis instrumented by Camino, and the result is assembled andlinked into a binary executable. Camino features a numberof profiling passes and optimizations, but for this study weimplement and use only the profiling and instrumentationpass described below. All of the binary executables pro-duced for this study target the x86 64 instruction set.

4.2 BenchmarksWe use the SPEC CPU 2006 benchmarks for this study.

Of the 29 benchmarks, 23 compile and run without errorswith our compiler infrastructure. These benchmarks arelisted in the x-axes of several graphs in later sections.

4.3 Generating Random Object Orderings

Each benchmark is compiled once with Camino. Theresulting object files can be linked to make a binary exe-cutable. We use a program that accepts a seed to a pseudo-random number generator to generate a pseudo-random butreproducible orderings of the object files. This programtakes as input a list of object files and produces as outputa linker command that links the object files in the pseudo-random order.

4.4 SystemWe perform our study using four seven-processor Dell sys-

tems with identical configurations running the 64-bit versionof Ubuntu Linux 8.04 Server and a custom compiled kernelwith performance monitoring counter support. Each systemcontains seven quad-core Intel Xeon E5440 processors. The

Intel processor 5400 Series are based on the Intel Harper-town core which is the same core used in quad-core IntelCore 2 processors. Each processor has 16GB of SDRAM and12MB second level cache. Each core in the Intel Xeon E5440processor has 32KB instruction and 32KB data caches. Thebranch predictor of the Intel Xeon E5440 is not documented,but through reverse-engineering experiments we have deter-mined that it is likely to contain a hybrid of a GAs-stylebranch predictor and a bimodal branch predictor [20, 18, 4].


10/72

10

2

0

2

4

PercentDifferenceinCPI

400.perlbench

401.bzip2

403.gcc

410.bwaves

416.gamess

429.mcf

433.milc

434.zeusmp

435.gromacs

436.cactusAMD

444.namd

445.gobmk

450.soplex

454.calculix

456.hmmer

459.GemsFDTD

462.libquantum

464.h264ref

465.tonto

471.omnetpp

473.astar

482.sphinx3

483.xalancbmk

Benchmark

Figure 1: Violin plots for SPEC CPU 2006 percentage performance variation with object reordering.

4.5 Running with Performance Monitoring Coun-ters

We measure a number of performance monitoring countersusing the perfex command found in the PAPI performancemonitoring package [14]. The Intel Core processor allows upto two user-defined microarchitectural events to be countedsimultaneously. We are interested in more than two events,so we make multiple runs of each benchmark to collect all ofthe desired counters. We group the counters into three setsof two. For each set we run each benchmark five times andtake the measurements given by the run with the mediannumber of cycles. Only the microarchitectural events thatoccur while user code is running are counted, thus the im-pact of system events is minimized. We collect the followingstatistics:

1. Retired branches mispredicted.

2. Retired x86 instructions excluding exceptions and in-terrupts.

3. L1 instruction cache misses.

4. L2 cache misses.

5. Elapsed clock cycles.

From these counters, we can derive other statistics suchas cycles-per-instruction (CPI), branch mispredictions per1000 instructions (MPKI), various cache miss rates, etc.

Although each system is configured identically and eachcore has the same microarchitecture, we use the Linux taskset

command to make sure that each benchmark always runson the same core to eliminate the effect of possible slightdifferences among the cores. Each run is performed on anotherwise quiescent system with as many system servicesstopped as possible without compromising the ability to ac-cess remote files and log in remotely. Stack address ran-domization, a security feature that resists stack-smashingattacks, is disabled to minimize performance variance notdue to code placement.

4.6 SimulationWe develop several branch predictor simulators. We im-

plement these as a tool in Pin [10]. We then run pin onthe same binary executables that we run natively. Our Pintool instruments each branch with a callback to code thatsimulates a set of branch predictors. The tool counts thenumber of branches executed and the number of branchesmispredicted for each predictor simulated.

4.7 Timing ConcernsMany of the SPEC CPU 2006 benchmarks run for over

30 minutes on the first ref input. For this study, we haveexecuted each of the 23 benchmarks at least 100 times on aset of 4 computers. To facilitate this study, we instrumentthe benchmarks such that under native execution they runfor up to approximately two minutes each. To do this, weimplement a two-pass profiling and instrumentation pass in

the Camino compiler. The first pass inserts instrumenta-tion that collects information about each procedure. Thebenchmark is allowed to run for two minutes. Then the col-lected information is analyzed to find a procedure with alow dynamic count that is also executed near the end of thetwo-minute run. The second pass of the compiler instru-ments only that procedure such that when it is executed thesame number of times as before, the program is ended. Thefirst instrumentation has low overhead, thus the resultingexecutable runs for approximately two minutes. The secondinstrumentation affects a low-frequency procedure and takestwo x86 instructions, thus it has negligible overhead. All ofthe binary executables in this study are compiled from thissecond instrumentation, or are from benchmarks that natu-rally run for less than two minutes. Because we are countingprocedures and not elapsed time, each run of a benchmark

executes the same number of user instructions.

5. BUILDING PERFORMANCE MODEL FOR

MICROARCHITECTURE STRUCTURESIn this section we show how to build a performance model

for branch prediction using object code reordering. We de-velop and evaluate regression models for a number of bench-marks. We explore using several characteristics of program


11/72

11

-0.5

0.0

0.5

1.0

CorrelationCoefficient

Branch Mispredictions

L1 Instruction Cache Misses

L2 Cache Misses Due to Fetch

Combined Estimator

400.perlbench

401.bzip2

403.gcc

410.bwaves

416.gamess

429.mcf

433.milc

434.zeusmp

435.gromacs

436.cactusADM

444.namd

445.gobmk

450.soplex

454.calculix

456.hmmer

459.GemsFDTD

462.libquantum

464.h264ref

465.tonto

471.omnetpp

473.astar

482.sphinx3

483.xalancbmk

ArithmeticMean

Benchmark

Figure 2: Correlation coefficients of various measurements.

behavior such as branch prediction and cache misses.

5.1 StatisticsWe make use of some statistical techniques in the perfor-

mance model study. We briefly review these techniques:

1. Correlation coefficients. Also known as Pearsons r,

correlation coefficients range from -1.0 to 1.0 and mea-sure the correlation between two random variables. Ahigher magnitude for r means a higher degree of cor-relation. A negative value of r means that the twovariables are negatively correlated.

2. Coefficient of determination. The sample coefficient ofdetermination, computed as r2 where r is the correla-tion coefficient, gives the fraction of dependence of agiven observation on an underlying factor.

3. Linear regression. We develop several estimators ofperformance using linear least-squares regression thatfinds a best-fit equation of a line between two vari-ables, e.g. we find a linear equation in terms of MPKIthat estimates CPI. The best fit minimizes the sum of

squared errors of the regression line and the observeddata. We also use multi-linear least-squares regressionto produce an estimator for performance in terms ofseveral observed variables.

4. Hypothesis testing. We use Students t-test for hy-pothesis testing. We formulate a null hypothesis, e.g.there is no correlation between CPI and MPKI thenuse hypothesis testing to see if the null hypothesis canbe rejected. We consider a result significant if the nullhypothesis can be rejected with p = 0.05, i.e., the prob-ability that the null hypothesis is not true is 95%. Stu-dents t-test gives a meaningful result in the presenceof normally distributed data.

5. Confidence intervals and prediction intervals. For the

linear regression lines, we plot 95% confidence intervalsand 95% prediction intervals, which are closely relatedto the t-test mentioned above. A 95% confidence in-terval has a 95% chance of containing the true regres-sion line, i.e., of all the data collected, the line thatbest illustrates the linear relationship between CPI andMPKI has a 95% chance of being in that confidence in-terval. The larger 95% prediction interval has a 95%chance of containing all of the observations (i.e. CPIs)

that would be encountered in a given domain (i.e. setof MPKIs).

5.2 Establishing CorrelationObject file reordering can elicit a wide range of CPIs for

our benchmarks, but before we can make use of this wemust establish that there is correlation between microar-

chitectural events measured and performance observed. Wefocus on what we believe to be the microarchitectural eventsmost likely to be affected by code placement:

1. Branch mispredictions. Conditional branch predictorsuse the address of an instruction to index one or moretables. If two or more branches conflict with one an-other in these tables, a phenomenon called aliasing[12],branch prediction accuracy can suffer.

2. L1 instruction cache misses. The Intel Xeon Core hasa 32KB 8-way set associative instruction cache. If nineor more frequently used blocks map to the same set,there will be frequent cache misses.

3. L2 cache misses.

4. We also use multi-linear regression to develop a com-bined model that takes into account all three of theseevents in the hope that a combined model will be moreaccurate than using one of the observations by itself.

Figure 2 shows the correlation coefficients between CPIand the counts of these events. Clearly, for most bench-marks, branch prediction is most significantly correlated withperformance. Some benchmarks show negative correlationsbetween CPI and events. The reason is sometimes the branchmisprediction and caches misses cause the prefetch of theuseful data which is beneficial for the performance. It mustbe emphasized that the correlation we report between mi-croarchitectural events and performance is with respect toobject code ordering. Other changes to the execution envi-

ronment would show other correlations. For instance, if weallowed the operating system to do stack address randomiza-tion, we would observe significant variance in L1 data cachemisses, and a commensurate impact on performance.

5.3 Assigning BlameUsing r2, the coefficient of determination, we can deter-

mine what portion of performance is due to a particularmicroarchitectural event. Figure 3 shows the cumulative r2


12/72

12

400.perlbench

401.bzip2

403.gcc

410.bwaves

416.gamess

429.mcf

433.milc

434.zeusmp

435.gromacs

436.cactusADM

444.namd

445.gobmk

450.soplex

454.calculix

456.hmmer

459.GemsFDTD

462.libquantum

464.h264ref

465.tonto

471.omnetpp

473.astar

482.sphinx3

483.xalancbmk

ArithmeticMean

Benchmark

0.0

0.5

1.0

CoefficientofDetermination

L2 Cache MissesL1 Instruction Cache Misses

Branch Mispredictions

Combined Estimator

Figure 3: Coefficient of determination showing how much of each type of event accounts for overall p erfor-mance.

for each of the three events, as well as r2 for the combinedregression model. Some benchmarks are more sensitive; forinstance, 84.2% of the CPI variance of 462.libquantum isdue to branch mispredictions. For the average model, 82.3%of the variance can be blamed on branch prediction.

The average bar for the combined model does not reach

exactly the same height as that of the sum of the threemeasurements. This is because the three measurements arenot altogether independent of one another; for instance, insome cases, a branch misprediction might cause an L1 cacheevent, sometimes causing cache pollution and other timescausing prefetching.

5.4 Establishing Statistical SignificanceClearly many benchmarks performance show correlation

with microarchitectural events. However, we must ask whetherthe correlation is statistically significant. It could be thecase that enough accidental correlation exists to make anymodels derived from these events meaningless.

We use Students t-test to determine statistical signifi-cance. For each of the three measurements as well as thecombined model we attempt to reject the null hypothesisthat there is no correlation. The value p 0.05 for the t-testis traditionally accepted as proof of statistical significance.

Table 5.4 shows yes for each combination of measure-ment and benchmark where the null hypothesis can be re-

jected with at most p = 0.05, i.e., with 95% probabilitythere is correlation between CPI and the measurement forthat benchmark.

5.4.1 Blame the Branch Predictor

Of the 23 benchmarks, 13 show significant correlation be-tween CPI and branch prediction. In other words, for overhalf of the benchmarks, we determined that there was atleast a 95% chance that our performance model techiniquefound significant correlation between CPI and MPKI. Forthe other benchmarks, there was not enough range of MPKI

to predict CPI.No other measurement consistently shows statistically sig-

nificant correlation with CPI. The combined estimator doesnot increase the number of benchmarks showing significantcorrelation. Thus, in this paper we focus our attention onbranch prediction.

5.4.2 Other Measurements

The ob ject file reordering methodology clearly elicits a

Event

L1 L2Benchmark Branch I-Cache Cache CombinedName MPKI Misses Misses Estimator

400.perlbench yes yes - yes401.bzip2 - - - -

403.gcc yes yes yes yes410.bwaves - - - -416.gamess yes - yes yes429.mcf yes - - yes433.milc - - - -434.zeusmp - - yes yes435.gromacs yes - - yes436.cactusADM - - - -444.namd - - yes -445.gobmk yes yes - yes450.soplex - - yes -454.calculix - - yes yes456.hmmer yes - - yes459.GemsFDTD - - - -462.libquantum yes - yes yes464.h264ref yes - - yes465.tonto yes yes - yes471.omnetpp yes - - yes473.astar yes - yes yes482.sphinx3 - - - -483.xalancbmk yes - - yes

Table 1: Yes means that the null hypothesis ofno correlation is rejected with p 0.05, i.e., with95% probability, the given measurement is corre-lated with CPI.

large impact on branch prediction. In future work, we willexplore other methods of manipulating binaries to empha-size other measurements such as instruction cache and datacache misses. In future work we will study the impact ofother events dependent on code placement. For instance,the number and type of x86 instructions in a 32-byte fetchblock on the Opteron has a large impact on the efficiencydecoding, but at this point it is not clear how to measurethat impact.


13/72

13

5.5 A Linear Performance ModelWe use least-squares linear regression to derive branch

prediction performance models for each of the benchmarksthat passed the hypothesis testing phase. For each bench-mark, we find the best fit of the observed data to a regressionline y = mx + b where y is CPI and x is MPKI. The slope(m) gives the cost for performance of one additional MPKIand the y-intercept (b) gives the predicted average CPI for

perfect branch prediction, i.e. 0 MPKI.We also derive 95% confidence intervals and 95% predic-tion intervals for the regression lines. Figure 4 shows theregression line and intervals for 400.perlbench. The con-fidence interval has a 95% chance of containing the trueregression line for the data observed. The much wider pre-diction interval has a 95% chance of containing future obser-vations. Linear regression allows us to make the followingpredictions for this benchmark with 95% probability:

1. A perfect branch predictor would yield a CPI of 0.5170.029, an improvement of 26.0% 4.2%.

2. Halving the average MPKI from 6.50 to 3.25 wouldimprove CPI by 13.0%2.2% from 0.70 to 0.610.022.

3. A 10% improvement in CPI due to branch predictionimprovement would require a 38% reduction in mis-predictions.

Table 2 shows the slopes and y-intercepts found by linearregression for each benchmark. It also shows the high andlow prediction intervals for perfect prediction.

6. REVERSE ENGINEERING BRANCH PRE-

DICTOR AND VALIDATIONIn this section we will discuss the techniques and the re-

sults of our branch predictor reverse engineering experiment.First, we will try to find out an overview of the branch pre-dictor and make some educated guesses about its organiza-tion. Then we will use the object code reordering technique

6.0 6.5 7.0Mispredictions per 1000 Instructions

0.68

0.70

0.72

CyclesperInstructions

95% prediction intervals

95% confidence intervalsLeast-squares regression line

Performance counter measurements

Figure 4: Regression line and 95% confidence andprediction intervals for 400.perlbench.

Benchmark Slope y-intercept Low High

400.perlbench 0.028 0.517 0.488 0.546403.gcc 0.028 1.839 1.796 1.882416.gamess 0.041 0.548 0.519 0.577429.mcf 0.019 4.675 4.531 4.819435.gromacs 0.020 0.811 0.795 0.827445.gobmk 0.019 0.643 0.515 0.771456.hmmer 0.041 0.203 0.032 0.375

462.libquantum 0.022 1.432 1.433 1.431464.h264ref 0.032 0.466 0.451 0.481465.tonto 0.027 0.632 0.617 0.647471.omnetpp 0.036 1.901 1.860 1.941473.astar 0.022 2.373 2.289 2.456483.xalancbmk 0.029 1.914 1.881 1.947

Table 2: Least-squares regression model relatingbranch prediction to performance. Shows high andlow prediction intervals for perfect prediction i.e. 0MPKI.

to verify which of these branch predictor organizations mostclosely resembles the original one.

6.1 Demystifying Branch PredictorAs branch predictor has substantial impact on the perfor-

mance of a program, it is regarded as one of the most im-portant component of modern day microprocessor. For thisvery reason,to achieve best performance they are becomingmore and more complex. Not only multiple predictors arecombined together to form hybrid predictor but also com-plex hash functions are used to index different tables. Thus,it is difficult to determine the exact configuration. Usingdifferent micro benchmarks as proposed in [19] and [13] wecan make an educated guess. We use techniques similar tothose proposed by Uzelac et al. to determine the attributesof the Intel Core 2 Branch Predictor.

To determine the global history length we use with the fol-lowing experiment. We generate patterns of lengths ranging

from 2 to 16. For each pattern length we generate 20 sam-ples. For each samples we execute microbenchmarks of fig 5and fig 6 5 times and measure the number of mispredictionsusing Intel VTune Performance Analyzer.

int main( void ){

i nt p at t e rn [ MAX LENGTH] , i , j , a=0 ;r e a d p a t t e r n ( p a t t e r n , l e n g t h ) ;fo r ( i =0; i


14/72

14

int main( void ){

int pa tt er n [MAX LENGTH] , i , j , a=0, b=0;r e a d p a t t e r n ( p a t t e r n , l e n g t h ) ;f or ( i =0; i


15/72

15

Figure 8: Correlation between actual and simulated branch predictor

We calculate the correlation coefficient between the MPKIfrom the original and simulated branch predictors mispre-dictions. We want to find a configuration that provides cor-relation coefficient close 1.0

Figure 8 shows a graph, plotted with data recorded fromoriginal branch predictor and our simulated one for all the 14benchmarks. From the graph we can see most of the bench-marks are very close to the diagonal line that represents theideal correlation. We found the only benchmark that straysa little bit away from the line is 445.gobmk. The best cor-relation coefficient we found so far is 0.993 by configuration1, which shows high level of confidence.

Figure 10: Global History table access hash

The structure contains a 16K-entry table of bimodal two-bit saturating counters indexed by 12 bit from programcounter and 12 bit from global history, 10 bit xor-ed to-gether Figure 10. We also found that there is a 128 entry,2-way set associative loop predictor. Using this object codereordering technique we successfully manage to validate ourassumption about the branch predictor.

7. FUTURE WORKWe see great potential in this technique to enhance our

ability to accurately model microarchitectural behavior. Weplan five main thrusts for our future work:

1. We apply this technique to other microarchitecturalstructures such as the instruction cache, the second-level cache, instruction decoders, indirect branch pre-diction, etc. Preliminary evidence we have gathered

indicates that acquiring more data points will lead tostatistically significant estimates of performance fromL1 instruction cache misses for many benchmarks.

2. Code placement is easily manipulated by object codereordering, but this technique is sometimes insufficientto elicit a wide range of program behaviors. We willinvestigate other ways to affect code placement. Wewill also look at other factors such as placement ofdata in the program text, on the stack, and with thememory allocator, to allow using binary interferometryto estimate the performance impact of the data caches.

3. We will validate this approach with other microarchi-tectures such as Intel Core i7 and other processors aswe are able to acquire access to them.

4. We will investigate how to determine the best numberof samples to choose for each benchmark to balance ac-curate estimation with computationally intensive col-

lection of measurements and simulation.

5. We will use this technique to validate reverse-engineeredbranch predictor models and will try to incorporatemore branch prediction features like update p olicy,hash function accuracy. Using this technique, we canbe sure that the reverse-engineered model behaves ex-actly the same as the real predictor.

8. CONCLUSIONWe have presented a technique for developing a perfor-

mance model for programs running on a given microarchi-tecture based on the effect of code placement on certainmicroarchitectural structures. We have shown that reverseengineered branch predictor model can be successfully vali-dated using this technique.

9. REFERENCES

[1] Brad Calder and Dirk Grunwald. Reducing branchcosts via branch alignment. In Proceedings of the 6thInternational Conference on Architectural Support forProgramming Languages and Operating Systems,October 1994.


16/72

16

[2] Gilberto Contreras and Margaret Martonosi. Powerprediction for Intel XScale Rprocessors usingperformance monitoring unit events. In ISLPED 05:Proceedings of the 2005 international symposium onLow power electronics and design, pages 221226, NewYork, NY, USA, 2005. ACM.

[3] D.J.Hatfield and J.Gerald. Program restructuring forvirtual memory. IBM Systems Journal, 10(3):168192,

1971.[4] M. Evers, P.-Y. Chang, and Y. N. Patt. Using hybridbranch predictors to improve branch predictionaccuracy in the presence of context switches. InProceedings of the 23rd International Symposium onComputer Architecture, May 1996.

[5] Domenico Ferrari. Improving locality by criticalworking sets. Communications of the ACM,17(11):614620, November 1974.

[6] Nikolas Gloy and Michael D. Smith. Procedureplacement using Temporal-Ordering information.ACM Transactions on Programming Languages andSystems, 21(5):9771027, September 1999.

[7] Chunling Hu, John McCabe, Daniel A. Jimenez, andUlrich Kremer. The camino compiler infrastructure.

SIGARCH Comput. Archit. News, 33(5):38, 2005.[8] Daniel A. Jimenez. Co de placement for improving

dynamic branch prediction accuracy. In Proceedings ofthe ACM SIGPLAN 2005 Conference on ProgrammingLangu age Design and Implementation (PLDI), pages107116, June 2005.

[9] Dan Knights, Todd Mytkowicz, Peter F. Sweeney,Michael C. Mozer, and Amer Diwan. Blindoptimization for exploiting hardware features. In CC09: Proceedings of the 18th International Conferenceon Compiler Construction, pages 251265, Berlin,Heidelberg, 2009. Springer-Verlag.

[10] Chi-Keung Luk, Robert Cohn, Robert Muth, HarishPatil, Artur Klauser, Geoff Lowney, Steven Wallace,Vijay Janapa Reddi, and Kim Hazelwood. Pin:

building customized program analysis tools withdynamic instrumentation. In PLDI 05: Proceedings ofthe 2005 ACM SIGPLAN conference on Programminglanguage design and implementation, pages 190200,New York, NY, USA, 2005. ACM.

[11] Scott McFarling. Program optimization for instructioncaches. In Proceedings of the Third InternationalConference on Architectural Support for ProgrammingLanguages and Operating Systems, pages 183191.ACM, 1989.

[12] Pierre Michaud, Andre Seznec, and Richard Uhlig.Trading conflict and capacity aliasing in conditional

branch predictors. In Proceedings of the 24thInternational Symposium on Computer Architecture,pages 292303, June 1997.

[13] M Milenkovic, A Milenkovic, and J Kulick.Microbenchmarks for determining branch predictororganization. In Software Practice and Experience,pages 465488. John Wiley & sons, 2004.

[14] Shirley Moore, David Cronk, Felix Wolf, Avi

Purkayastha, Patricia Teller, Robert Araiza,Maria Gabriela Aguilera, and Jamie Nava.Performance profiling and analysis of dod applicationsusing papi and tau. In DOD UGC 05: Proceedings of

the 2005 Users Group Conference on 2005 UsersGroup Conference, page 394, Washington, DC, USA,2005. IEEE Computer Society.

[15] Todd Mytkowicz, Amer Diwan, Matthias Hauswirth,and Peter F. Sweeney. Producing wrong data withoutdoing anything obviously wrong! In ASPLOS 09:Proceeding of the 14th international conference onArchitectural support for programming languages andoperating systems, pages 265276, New York, NY,USA, 2009. ACM.

[16] Karl Pettis and Robert C. Hansen. Profile guided code

positioning. In Proceedings of the ACM SIGPLAN90Conference on Programming Language Design andImplementation, pages 1627, June 1990.

[17] Shai Rubin, Rastislav Bodk, and Trishul Chilimbi. Anefficient profile-analysis framework for data-layoutoptimizations. In POPL 02: Proceedings of the 29thACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages, pages 140153, New York,NY, USA, 2002. ACM.

[18] James E. Smith. A study of branch predictionstrategies. In Proceedings of the 8th AnnualInternational Symposium on Computer Architecture,pages 135148, May 1981.

[19] Vladimir Uzelac and Aleksandar Milenkovic.Experiment flows and microbenchmarks for reverse

engineering of branch predictor structures. In ISPASS,pages 207217. IEEE, 2009.

[20] T.-Y. Yeh and Yale N. Patt. Two-level adaptivebranch prediction. In Proceedings of the 24thACM/IEEE International Symposium onMicroarchitecture , pages 5161, November 1991.

[21] Cliff Young, David S. Johnson, David R. Karger, andMichael D. Smith. Near-optimal intraproceduralbranch alignment. In Proceedings of the SIGPLAN97Conference on Program Language Design andImplementation, June 1997.


17/72

17

Synthesizing Contention

Jason MarsUniversity of Virginia

[email protected]

Mary Lou SoffaUniversity of Virginia

[email protected]

ABSTRACT

Multicore microarchitecture designs have become ubiquitousin todays computing environment enabling multiple pro-cesses to execute simultaneously on a single chip. Withthese new parallel processing capabilities comes a need tobetter understand how co-running applications impact andinterfere with each other. The ability to characterize andbetter understand cross-core performance interference canprove critical for a number of application domains, such as

performance debugging, compiler optimization, and appli-cation co-scheduling to name a few. We proposed a novelmethodology for the characterization and profiling of cross-core interference on current multicore systems, which we callcontention synthesis. Our profiling approach characterizesan applications cross-core interference sensitivity by manu-facturing contention with the application and observing theimpact of this synthesized contention on the application.

Understanding how to synthesize contention on currentchip microarchitectures is unclear as there are a number ofpotentially contentious data access behaviors. This is fur-ther complicated by the fact that current chip micropro-cessors are engineered and tuned to circumvent the con-tentious nature of certain data access behaviors. In thiswork we explore and evaluate five designs for a contention

synthesis mechanism. We also investigate how these fivecontention synthesis engines impact the performance of 19of the SPEC2006 benchmarks on two state of the art chipmultiprocessors, namely Intels Core i7 and AMDs PhenomX4 architectures. Finally we demonstrate how contentionsynthesis can be used to accurately characterize an applica-tions cross-core interference sensitivity.

Categories and Subject Descriptors

D.1.3 [Programming Techniques]: Concurrent Program-mingparallel programming; D.3.4 [Programming Lan-guages]: Processorsrun-time environments, compilers, op-timization, debuggers; D.4.8 [Operating Systems]: Per-formancemeasurements, monitors

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WBIA 2009, Dec 12, New York City, NYCopyright 2009 ACM 978-1-60558-793-6/12/09 ...$10.00.

General Terms

Performance, Contention, Multicore

Keywords

cross-core interference, profiling framework, program under-standing

1. INTRODUCTIONMulticore architectures are quickly becoming ubiquitous

in todays computing environment. With each new gen-eration of general purpose processors, much of the perfor-mance improvement in micro-architectural design is typi-cally achieved by increasing the number of individual pro-cessing cores on a single chip. However, shared on-chipresources and the memory subsystem is typically sharedamong many cores, resulting in a potential performance bot-tleneck when scaling up multiprocessing capabilities.

Current multicore architectures include early levels of smallcore-specific private caches, and larger caches shared amongmultiple cores [8]. When multiple processes or threads runin tandem, they can contend for the shared cache by evictinga neighboring process or threads data in order to cache itsown data. This form of contention occurs when the workingset of the neighboring processes or threads exceed the sizeof the private caches, relying on the shared cache resources.This contention can result in a significant degradation inapplication performance.

When an application suffers a performance degradationdue to contention for shared resources with an applicationon a separate processing core, we call this cross-core inter-

ference.The ability to characterize an applications sensitivity to

cross core interference can prove indispensable for a num-ber of applications. Compiler and optimization develop-ers can use knowledge about the contention sensitivity ofa particular code region to develop both static and dynamiccontention conscious compiler techniques and optimizationheuristics. For example, knowledge of contention sensitive

code regions can be used to direct where software cacheprefetching should be applied. Software developers can alsouse this cross-core interference sensitivity information forperformance debugging and software tuning. The ability tocharacterize the most contention sensitive application phasesallows the programmer to pin-point parts of the applicationon which to focus. In addition to performance debugging,software tuning, and compiler optimization, the character-ization of application sensitivity to cross core interference


18/72

18

AMD Phenom X4

Intel Core i7 Quad

1x

1.15x

1.2x

1.25x

1.3x

1.35x

1.4x

astar

bzip2

dealII

gcc

gobmk

h264ref

hmmer

lbm

libquantum

mcf

milc

namd

omnetpp

perlbench

povray

sjeng

soplex

sphinx3

xalancbmk

mean

Slowdown

1.05x

1.1x

Figure 1: Performance impact due to contention from co-location with LBM.

can enable smarter, contention conscious dynamic and on-line scheduling techniques. For example, the understandingof an applications contention sensitivity characteristics en-ables contention conscious application co-scheduling. Ap-plications that have higher demands on shared memory re-sources can be co-located with applications that have a lowerdemand to gain better performance and throughput.

While ad-hoc and indirect approaches, such as measuringcache hits and misses via performance counters, can give acoarse indication of cross-core interference sensitivity, theyare not sufficient to provide accurate and detailed profil-ing information about on-chip contention. Monitoring theshared cache misses directly is not sufficient in that not allcache misses reported by the hardware performance moni-toring unit are misses that relate to code dependencies. Inmodern processors many misses reported by the hardwaremonitors are caused by hardware prefetches and hardwarepage table walks [8]. These effects do not relate to contentionand are indistinguishable from cache misses that do.

We propose a cross-core interference profiling environmentthat uses contention synthesis. To accurately characterizeand profile an applications sensitivity to cross-core inter-

ference, we synthesize contention, meaning we syntheticallycreate contention with the host application. This contentionsynthesis is achieved by synthetically applying pressure onthe shared cache using a contention synthesis engine (CSE).The profiling framework manipulates the execution of theCSE while observing the effect on the host application, mea-suring the impact over time, and assigning an impact scoreto the application. However, understanding how to synthe-size contention on current chip microarchitectures is unclearas there are a number of differing contentious data accessbehaviors in addition to the fact that current chip micro-processors are engineered and tuned to circumvent the poorcache performance of certain data access behaviors.

In this work we explore the design space of our contentionsynthesis engine and investigate how contention synthesis

can be used to characterize cross-core performance inter-ference on modern multicore architectures. We have de-signed and evaluated five contention synthesis mechanismsthat mimic five common data access behaviors. These in-clude the random access of elements in a large array, therandom traversal of large linked data structures, a real worldfluid dynamics application (the lbm SPEC2006 benchmark),data movement in 3D object space commonly found in sim-ulations and scientific computing, and finally, a contention

synthesis engine that was constructed by reverse engineeringlbm, finding its most contentious code, and further tweakingit to construct a highly contentious synthesis engine. In ad-dition to presenting the design and implementation of thesecontention synthesis methods we investigate how these fivecontention synthesis engines impact the performance of 19of the SPEC2006 benchmarks on two state of the art chip

multiprocessors, Intels Core i7 and AMDs Phenom X4 ar-chitectures. We also answer a number of questions as towhether the cross-core interference properties of applicationstend to remain consistent regardless of the particular con-tention synthesis method chosen. Finally we demonstratehow contention synthesis can be used to accurately charac-terize an applications cross-core interference sensitivity.

The contributions of this work includes:

The discussion of a novel methodology for the char-acterization of cross-core performance interference oncurrent multicore architecture.

The design and implementations of five contention syn-thesis mechanisms.

The evaluation and study of the impact these five con-tention synthesis mechanism on 19 SPEC2006 bench-marks on both the Intel Core i7 and AMD Phenom X4Architectures.

The cross-core interference sensitivity scoring of theSPEC2006 benchmarks.

Next in Section 2 we motivate our work. We then providean overview of our profiling approach in Section 3. Section 4presents our contention synthesis methodology. We evaluateour contention synthesis approach in Section 5. Section 6presents related work, and finally, we conclude in Section 7.

2. MOTIVATIONWith the recent growth in popularity of multicore archi-

tecture, comes an increase in the parallel processing capa-bilities of commodity systems. These commodity systemsare now common-place both in the general purpose desktopand laptop markets as well as in industry data-center andcomputing clusters. Companies such as Google, Yahoo, andMicrosoft use these off the shelf computing components tobuild their data-centers as they are cheap, abundant andeasily replaceable [3]. The increase in parallel processing ca-pabilities in these chip architectures are in fact leading to


19/72

19

server consolidation. However, the memory wall is prevent-ing these parallel processing capabilities from being fullyrealized.

The memory subsystem on current commodity multicorearchitectures is shared among the processing cores. Two rep-resentative examples of the state of the art multicore chipdesigns are the Intel Core i7 Quad Core chip and AMDsPhenom X4 Quad Core. Intels Core i7 has four process-

ing cores, each with a private 32kb L1 cache and a 256kbL2 cache. A large 8mb L3 cache is shared among the fourcores [8]. AMDs Phenom X4 also has 4 cores with a similarcache layout. Each core has a private 64kb L1 and 512kbL2, with a shared 6mb L3 cache. These chips were designedto accommodate 4 simultaneous streams of execution. How-ever, as we can see through experimentation, their sharedcaches and memory subsystem often cannot efficiently ac-commodate even 2 co-running processes.

Figure 1 illustrates the potential cross-core interferencethat can occur when multiple co-running applications areexecuting on current multicore architectures. We performthe following experiment using the Core i7 and Phenom X4architectures. In this experiment we study the cross-coreperformance interference caused to each of the SPEC2006benchmarks when co-running with lbm (one of the SPEC2006

benchmarks known be especially heavy on the on-chip mem-ory subsystem). Figure 1 shows the slowdown of each bench-mark due to the cross-core interference. Each applicationwas executed to completion on their ref inputs. On they-axis we show the execution time of the application whileco-running with lbm normalized to the execution-time of theapplication running alone on the system. The first bar inFigure 1 presents this data for the Core i7 architecture, andthe second bar for the Phenom X4. As this graph shows,there are severe performance degradations due to cross-coreinterference on a large number of Spec benchmarks. Thelarge last level on-chip caches of these two architecturesdo little to accommodate many of these co-running appli-cations. On a number of benchmarks including lbm, mcf,omnetpp, and sphinx, this degradation approaches 35%.

In addition to the general performance degradation, thissensitivity to cross-core interference is particularly undesir-able for real time and latency sensitive application domains.In the latency sensitive domain of web search for instancethis kind of cross core interference can cause unexpectedslowdowns, negatively impacting the QoS on a search query.A commonly used solution is to simply disallow the co-location of latency sensitive applications with others on asingle machine, resulting is lowered utilization and higherenergy cost [14].

Note that not all applications are effected by the con-tention properties of their co-runners. Applications such ashmmer, namd, and povray seem immune to lbms cross coreinterference. This observation shows that cross-core inter-ference sensitivity vary substantially across applications.

3. PROFILING FRAMEWORKContention and cross-core interference can only occur dy-

namically, and depends on a combination of the applica-tions memory behavior, the design of the particular un-derlying architecture, and the applications co-running onthis microarchitecture at any particular time. Because ofthese properties, characterizing this sensitivity using staticanalyses is intractable. Also, the traditional profiling of the

Core Core

Application

CiPE

Contention

Synthesis

Engine

Shared Cache

Profiler

Figure 2: Our Profiling Framework

application in isolation is not sufficient as no contention isactually occurring. Although it is possible to observe theperformance counters of current architecture designs to an-alyze cache misses, etc, these are indirect means to infer anapplications memory behavior as no actual contention orcross-core interference is occurring. Our methodology is to

profile the application with real contention in a controlledenvironment, where we manufacture contention using a con-tention synthesis mechanism and dynamically monitor andprofile the impact on the host application.

Figure 2 shows an overview of our cross core interferenceprofiling environment. This figure shows a multicore archi-tecture with two separate cores sharing an on-chip cacheand memory subsystem. The shaded boxes show our pro-filing framework, which is composed of the profiler runtimeand a contention synthesis engine (CSE). As shown on theleft side of Figure 2, the host application is controlled by theprofiler runtime and is monitored throughout the executionof the application. Before the execution of the host appli-cation, the profiler spawns the contention synthesis engineon a neighboring core, as shown to the right of the figure.

This CSE shares the cache and memory subsystem of thehost application. As the application executes, the CSE en-gine aggressively accesses memory trying to cause as muchcross-core interference as possible. The profiler manipulatesthe execution of the contention synthesis engine allowingbursts of execution to occur. Slowdowns in the applicationsinstruction retirement rate that result from this bursty exe-cution are monitored using the hardware performance mon-itoring (HPM) information [8]. This intermittent control ofthe CSE and monitoring of the HPM is achieved using theperiodic probing approach [15]. A timer interrupt is used toperiodically execute the monitoring and profiling directives.This has shown to be a very low overhead approach for thedynamic monitoring and analysis of applications.

4. SYNTHESIZING CONTENTIONMany types of applications cause cache contention. With

the continuing advances in micro-architectural design simplyaccessing a large amount of data does not necessarily implyhigh pressure on cache and memory p erformance. The typeof data access pattern and the way that data is mapped intothe cache is very important to consider when constructingthe contention synthesis engine. Structures such as hard-ware cache prefetchers and victim caches can avert poor and


20/72

20

contentious cache behavior even when the working set of theapplication is very large. The features and functionality ofthese hardware techniques are difficult to anticipate as ven-dors keep these details closely guarded. Access patterns thatexhibit a large amount of spatial or temporal locality caneasily be prefetched into the earlier and later levels of cache.

4.1 Designing Contention Synthesis

To design our contention synthesis engine we explored andexperimented with a number of common data access pat-terns. These designs consist of the random access of ele-ments in a large array, the random traversal of large linkeddata structures, a real world fluid dynamics application (thelbm SPEC2006 benchmark), data movement in 3d ob jectspace commonly found in simulations and scientific comput-ing and finally, and finally we reverse engineered lbm, foundits most contentious code, and further tweaked it to con-struct a highly contentious synthesis engine we call TheSledgehammer.

4.1.1 Naive

Figure 3 shows the C implementation of our naive con-tention synthesis mechanism.

#in clu d e #in clu d e #in clu d e #in clu d e

char d ata ;

main( i nt argc , ch ar argv [ ] ) {s r a n d ( t i m e ( 0 ) + g e t p i d ( ) ) ;i f ( a r g c l e f t ) ;i f ( p>rig h t ) re t+= tramp le (p>r i g h t ) ;ret+=(u n s ig n e d l o n g )p>t e x t [ p>d a t a% p a y l o a d s i z e ] ;p>dat a+=re t ; / / M o d i ng d a t a + 6 %p>t e x t [ p>d a t a % p a y l o a d s i z e ] = p>d ata%256;

}e l s e {

i f ( p>rig h t ) re t+= tramp le (p>r i g h t ) ;i f ( p>l e f t ) r e t +=t r a m p l e ( p> l e f t ) ;r et =(u n s ig n e d l o n g )p>t e x t [ p>d a t a% p a y l o a d s i z e ] ;p>dat a+=re t ; / / M o d i ng d a t a + 6 %p>t e x t [ p>d a t a % p a y l o a d s i z e ] = p>d ata%256;

}}retu rn r e t ;

}

i nt main( in t argc , char argv [ ] ) {in t f o o t p r i n t = 8 1 9 2 ;

BST b ;

s r a n d ( t i m e ( 0 ) + g e t p i d ( ) ) ;

u n s ig n e d i n t n o d e s i z e =s i z e o f ( t r e e n o d e ) +s i z e o f (BST) ;

fo r ( i nt i = 0; i


21/72

21

the tree touching and changing the data alone the way.

4.1.3 LBM from SPEC2006

The implementation of the LBM benchmark can be foundin the official SPEC2006 benchmarks suite [7]. LBM is animplementation of the Lattice Boltzmann Method (LBM).The Boltzmann Method is used to simulate incompressiblefluids. We selected this benchmark as one of our synthesis

mechanisms, as it proved to be one of the most contentious ofthe SPEC2006 benchmark suite. For a complete descriptionof LBM please refer to [7].

4.1.4 3D Data Movement

Figure 5 shows the C++ implementation of our 3D datamovement contention synthesis mechanism.

#in clu d e #in clu d e

using namespace s td ;

c o ns t i n t n u g s i z e =1 2 8 ;

c l a s s n u gge t {p u b l i c :

ch ar n [ n u g s i z e ] ;n u gge t (){

fo r ( in t i = 0; i


22/72

22

LBM Core

Naive

Sledge

Blockie

BST

1x

1.6x

astar

bzip2

dealII

gcc

gobmk

h264ref

hmmer

lbm

libquantum

mcf

milc

namd

omnetpp

perlbench

povray

sjeng

soplex

sphinx3

xalancbmk

mean

Slowdown 1.4x

1.3x

1.2x

1.1x

1.5x

Figure 7: Slowdown caused by contention synthesis on Intel Core i7.

LBM Core

Naive

Sledge

Blockie

BST

1x

1.6x

astar

bzip2

dealII

gcc

gobmk

h264ref

hmmer

lbm

libquantum

mcf

milc

namd

omnetpp

perlbench

povray

sjeng

soplex

sphinx3

xalancbmk

mean

Slowdown 1.4x

1.3x

1.2x

1.1x

1.5x

Figure 8: Slowdown caused by contention synthesis on AMD Phenom X4.

ref inputs). These benchmarks were compiled with GCC 4.4on the Linux 2.6.31 kernel. Figure 7 shows the results whenperforming this co-location on Intels Core i7 Quad architec-ture, and Figure 8 shows these results on AMDs PhenomX4 Quad. The bars show the slowdown when co-locatedwith naive random access (naive), binary search tree (BST),

the lbm benchmark (LBM Core), the 3D block data move-ment (Blockie), and our sledgehammer technique (Sledge),in that order. The lbm benchmark can be viewed as a base-line to compare the other synthetic engines. We believe lbmto be a good point of reference as it is actually a naturallyoccurring example of a contention application behavior. Itis clear from the graphs that the naive and BST approachesproduce the smallest amount of contention. However notethat they do an adequate job of indicating the applicationsthat are most sensitive to cross-core interference. The con-tention produced by these two approaches is low as there isa good bit of computation between single memory accesses.blockie and sledge touch large amounts of data in a sin-gle swoop and with less computation. Note that our Blockieand Sledge techniques are more effective than using the mostcontentious of the SPEC benchmarks.

Across the two architectures the general trend is similar,although we do see some differences. We see that applica-tions that tend to be sensitive to contention tend to be uni-formly so across these two representative architectures. Wealso see that the varying contention synthesis designs ranksimilarly on both architectures. This general trend supportsour hypothesis that contention is agnostic across this classof commodity multicore architectures.

Although the general trend is the same, there are someclear differences. For example the benchmark most sensi-tive to cross-core interference on the two architectures dif-fers. On Intels architecture mcf shows the most significantdegradation in performance, while on AMDs architecturelbm is the clear winner (or loser). These variations are due

to the idiosyncrasies of the microarchitectural design.The key observation is the fact that the effectiveness ofthe contention synthesis designs are mostly uniform acrossthe different benchmark workloads.

This trend supports our hypothesis that in addition tobeing generally agnostic across this class of commodity mul-ticore architectures, it is also agnostic across the varyingworkloads and memory access patterns present in SPEC.

In the following section we selected to use Sledge as ourmain CSE for our profiling framework as it most vividlyillustrates contention across the entire benchmark suite.

5.2 Characterizing Cross Core InterferenceTo characterize an applications cross core interference

sensitivity our profiling framework spawns the contention

synthesis engine (CSE) on a neighboring core. As the ap-plication executes, the profiling runtime directs the CSE toproduce short bursts of contentious execution. For everymillisecond of execution the profiler will pause the CSE forone millisecond. Slowdowns in the applications instructionretirement rate that result from this bursty execution aremonitored using the instructions_retired hardware per-formance counter. A cross-core interference sensitivity scoreis then calculated as the average of these slowdowns.


23/72

23

Performance Impact

Score

namd

omnetpp

perlbench

povray

sjeng

sphinx3

xalancbmk

mean

Slowdown

soplex

0x

0.05x

0.1x

0.15x

0.2x

0.25x

0.3x

0.35x

astar

bzip2

dealII

gcc

gobmk

h264ref

hmmer

lbm

libquantum

mcf

milc

Figure 9: Comparing Characterization Score Trend to Actual Cross-core Interference on Intel Core i7.

Performance Impact

Score

namd

omnetpp

perlbench

povray

sjeng

sphinx

Xalan

mean

Slowdown

soplex

0x

0.05x

0.1x

0.15x

0.2x

0.25x

0.3x

0.35x

astar

bzip2

dealII

gcc

gobmk

h264ref

hmmer

lbm

libquantum

mcf

milc

Figure 10: Comparing Characterization Score Trend to Actual Cross-core Interference on AMD Phenom X4.

Figure 9 and 10 show the cross-core interference sensi-tivity scores calculated using the described method for allC/C++ benchmarks in SPEC2006, compared against theperformance degradation when each benchmark is co-runningwith lbm, on both Intel Core i7 and AMD Phenom X4. Ourresults show that generally, an applications cross-core in-

terference sensitivity score has a strong correlation propor-tionally with its performance degradation (e.g a lower CISscores indicate smaller degradations and vice versa). Notethat Figures 9 and 10 are intended to demonstrate the trendof how the cross core interference sensitivity score relates tothe applications actual degradations due to cross core inter-ference. These figures display a strong trend, indicating thatour approach is indeed accurately characterizing cross-coreinterference sensitivity. However, there are three relative ex-ceptions, sphinx, xalan and astar. These three benchmarkshave very clear phases that seem to increase the inaccuracyon its average. One possible way to address this challenge isincreasing the periodic probing interval length. Also study-ing their phase level cross-core interference sensitivity scoreswould give more insight about their dynamic sensitivity.

We have also experimented with using the change in lastlevel cache misses per cycle to detect contention and measurecross-core interference. Our results show that it is a worseindicator than directly measuring performance degradationusing IPCs. Also last level cache misses alone is not always agood indicator either. For example, although an applicationwith a large number of cache misses per cycle because of itsheavy cache reliance may in fact be sensitive to cross-coreinterference, it could also be insensitive if this application

already experiences heavy cache misses when running alone.In this latter case cross-core interference would not hurt itsalready poor cache performance. This often occurs when theworking set of the application greatly exceeds the size of thelast level cache.

6. RELATED WORKIn this paper we present a profiling and characterization

methodology for program sensitivity to cross-core interfer-ence on modern CMP architecture. Related to our workis a cache monitoring system for shared caches [25], whichproposes novel hardware designs to facilitate better under-standing of how applications are interacting and contendingwhen running together. Similar to our work, the system isthen used for profiling and program behavior characteriza-tion. However, in contrast to our methodology, this workrequires hardware extensions and thus is evaluated usingsimulations. Our methodology and framework is applicableto current commodity multicore architectures. In addition,our framework is not limited to cache contention but anycontention in the memory system that would impact perfor-

mance, and can be produced by our CSE.In recent years, cache contention has received much re-

search attention. Most works focus on exploring the designspace of cache and memory proposing novel hardware solu-tions or managing policies to alleviate the contention prob-lem. Hardware techniques and related algorithms to enablecache management such as cache partitioning and memoryscheduler are proposed [22, 12, 19, 16, 4]. Other hardwaresolutions to guarantee fairness and QoS include [17, 10, 20].


24/72

24

Related to novel cache designs and architectural support, an-alytical models to predict impact of cache sharing are alsoproposed by [2]. In addition to new hardware cache man-agement, approaches to managing shared cache through OSare [21, 5]. Instead of novel hardware or software solution tomanaging shared caches, our solution focuses on the otherside of the problem, namely the applications inherent sensi-tivity to interference on existing modern microarchitecture.

The idea of profiling applications and learning about theirmemory-related characteristics from the profile to improveperformance and develop more effective compiler optimiza-tions is rather common. Much work has been done for con-structing a general framework for memory profiling of appli-cations [18, 9], profiling techniques and methods to use suchprofiling to improve performance or help develop better com-pilers and optimizations [9, 24]. Our work is different thatwe focuses on profiling programs behavior in the presenceof cross-core interference.

Contention conscious scheduling schemes that guaranteefairness and increase QoS for co-running applications or multi-threaded application have been proposed [13, 6, 1]. Fedorovaet al. used cache model prediction to enhance the OS sched-uler to provide performance isolation. There are also theo-retical studies that investigate approximation algorithms to

optimally schedule co-running jobs on CMPs [11, 23].

7. CONCLUSIONIn this paper, we present a methodology for profiling and

applications sensitivity to cross-core performance interfer-ence on current multicore microarchitectures by synthesiz-ing contention. Our profiling framework is composed of alightweight runtime environment on which a host applica-tion runs, along with a carefully designed contention syn-thesis engine that executes on a neighboring core. We haveexplored and evaluated five contention synthesis mechanismswhich include the random access of elements in a large ar-ray, the random traversal of large linked data structures,a real world fluid dynamics application, data movement in3D object space commonly found in simulations and scien-tific computing and finally, we reverse engineered lbm, foundits most contentious code, and further tweaked it to con-struct a highly contentious synthesis engine. We have pre-sented the design and implementation of these contentionsynthesis mechanisms and demonstrated their impact on theSPEC2006 benchmark suite on two real-world multicore ar-chitectures. Finally we demonstrate how contention syn-thesis can b e used dynamically, using a bursty executionmethod, to accurately characterize and applications cross-core interference sensitivity.

8. ACKNOWLEDGEMENTSWe would like to acknowledge the important input from

Robert Hundt and Neil Vachharajani of Google. Their work

and ideas has been instrumental for better understandingthe problem of contention, especially as it relates to thedata-center and cluster computing domains. We would alsolike to acknowledge the contribution of Lingjia Tang of theUniversity of Virginia. Her comments and insights have beenvery helpful to the work and paper. Finally, we would liketo thank Google for the funding support for this project.

9. REFERENCES[1] M. Banikazemi, D. Poff, and B. Abali. Pam: a novel

performance/power aware meta-scheduler formulti-core systems. In SC 08: Proceedings of the 2008ACM/IEEE conference on Supercomputing, pages112, Piscataway, NJ, USA, 2008. IEEE Press.

[2] D. Chandra, F. Guo, S. Kim, and Y. Solihin.Predicting inter-thread cache contention on a chip

multi-processor architecture. In HPCA 05:Proceedings of the 11th International Symposium onHigh-Performance Computer Architecture, pages340351, Washington, DC, USA, 2005. IEEEComputer Society.

[3] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A.Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E.Gruber. Bigtable: A distributed storage system forstructured data. ACM Trans. Comput. Syst.,26(2):126, 2008.

[4] J. Chang and G. S. Sohi. Cooperative cachepartitioning for chip multiprocessors. In ICS 07:Proceedings of the 21st annual internationalconference on Supercomputing, pages 242252, NewYork, NY, USA, 2007. ACM.

[5] S. Cho and L. Jin. Managing distributed, shared l2caches through os-level page allocation. In MICRO 39:Proceedings of the 39th Annual IEEE/ACMInternational Symposium on Microarchitecture, pages455468, Washington, DC, USA, 2006. IEEEComputer Society.

[6] A. Fedorova, M. Seltzer, and M. D. Smith. Improvingperformance isolation on chip multiprocessors via anoperating system scheduler. In PACT 07: Proceedingsof the 16th International Conference on ParallelArchitecture and Compilation Techniques, pages2538, Washington, DC, USA, 2007. IEEE ComputerSociety.

[7] J. L. Henning. Spec cpu2006 benchmark descriptions.SIGARCH Comput. Archit. News, 34(4):117, 2006.

[8] Intel Corporation. IA-32 Application DevelopersArchitecture Guide. Intel Corporation, Santa Clara,CA, USA, 2009.

[9] M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche.Memory profiling using hardware counters. In SC 03:Proceedings of the 2003 ACM/IEEE conference onSupercomputing, page 17, Washington, DC, USA,2003. IEEE Computer Society.

[10] R. Iyer, L. Zhao, F. Guo, R. Illikkal, S. Makineni,D. Newell, Y. Solihin, L. Hsu, and S. Reinhardt. Qospolicies and architecture for cache/memory in cmpplatforms. In SIGMETRICS 07: Proceedings of the2007 ACM SIGMETRICS international conference onMeasurement and modeling of computer systems,pages 2536, New York, NY, USA, 2007. ACM.

[11] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysisand approximation of optimal co-scheduling on chipmultiprocessors. In PACT 08: Proceedings of the 17thinternational conference on Parallel architectures andcompilation techniques, pages 220229, New York, NY,USA, 2008. ACM.

[12] S. Kim, D. Chandra, and Y. Solihin. Fair cachesharing and partitioning in a chip multiprocessor


25/72

25

architecture. In PACT 04: Proceedings of the 13thInternational Conference on Parallel Architectures andCompilation Techniques, pages 111122, Washington,DC, USA, 2004. IEEE Computer Society.

[13] R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn.Using os observations to improve performance inmulticore systems. IEEE Micro, 28(3):5466, 2008.

[14] S. Lohr. Demand for data puts engineers in spotlight.

The New York Times, 2008. Published June 17th.[15] J. Mars and R. Hundt. Scenario based optimization:

A framework for statically enabling onlineoptimizations. In CGO 09: Proceedings of the 2009International Symposium on Code Generation andOptimization, pages 169179, Washington, DC, USA,2009. IEEE Computer Society.

[16] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E.Smith. Fair queuing memory systems. In MICRO 39:Proceedings of the 39th Annual IEEE/ACMInternational Symposium on Microarchitecture, pages208222, Washington, DC, USA, 2006. IEEEComputer Society.

[17] K. J. Nesbit, J. Laudon, and J. E. Smith. Virtualprivate caches. In ISCA 07: Proceedings of the 34th

annual international symposium on Computerarchitecture, pages 5768, New York, NY, USA, 2007.ACM.

[18] N. Nethercote and J. Seward. Valgrind: a frameworkfor heavyweight dynamic binary instrumentation. InPLDI 07: Proceedings of the 2007 ACM SIGPLANconference on Programming language design andimplementation, pages 89100, New York, NY, USA,2007. ACM.

[19] M. K. Qureshi and Y. N. Patt. Utility-based cachepartitioni

Documents

Proceedings of the Workshop on Binary Instrumentation and Applications 2009