Comparing Doom 3, WarCraft III, PBRT, and MESA Using Micro ... › ~skadron › Papers › TR-CS-2007-04.pdfyears available on the Macintosh architecture: Doom 3 and Warcraft III:

Comparing Doom 3, WarCraft III, PBRT, and MESA UsingMicro-architecturally Independent Characteristics

Jiayuan Meng, Henry Cook, Kevin Skadron, Dee A.B. WeikleDepartment of Computer Science

University of Virginia{jm6dg, hmc3z, skadron, dweikle}@cs.virginia.edu

Technical Report Number: CS-2007-04

February 10, 2007

Abstract

Computer games have become a driving application inthe personal computer industry. For computer architectsdesigning general purpose microprocessors, understand-ing the characteristics of this application domain is im-portant to meet the needs of this growing market demo-graphic. In addition, games and 3D-graphics applicationsare some of the most demanding personal computer pro-grams. Understanding the characteristics of these appli-cations is complicated, though, by their competitive mar-ket. Source code is generally unavailable and benchmarksare geared more toward performance in the gaming worldthan understanding program characteristics important inarchitecture. To facilitate our architecture research, wehave performed a characterization of two popular games,Doom 3 and Warcraft III, and compared them to two pub-licly available programs, PBRT (Physically Based RayTracer) and MESA, one of the Spec2000 benchmarks for3D graphics, using microarchitecturally-independent met-rics. The dynamic execution of the programs was an-alyzed with modified Macintosh development tools anda comparison made with principal component analysistechniques. We found that the characteristics of thesegames differ from each other and the publicly availableprograms, particularly in memory reference patterns, andin their use of specialized instructions. From this prelim-inary investigation, we determine that games have uniquecharacteristics, but that PBRT is more similar to both thanMESA.

1 Introduction

Understanding the variety and individual characteristicsof application domains is critical to designing high per-formance architectures for not only the scientific commu-nity, but for the personal computer market as well. In thepersonal computer market, this task is complicated by thebusiness goals of the companies who write popular soft-ware. To remain competitive, these companies must keepprivate the source code for their applications, along withany techniques they use to make their programs faster,more useful, or (as is often the case for games) more ex-citing and realistic than their competition. This secrecymakes it very difficult for researchers in computer archi-tecture, especially those in academia, to understand thelow-level operation of these applications well enough todesign general-purpose architectures to support them.

With current trends toward multi-core architectures, theresearch community has a rare opportunity to shape theessential foundation of the next era of general purpose ar-chitectures. This opportunity makes it essential to under-stand the general characteristics of applications, becausewith multi-core architectures the number of possible de-sign solutions increases exponentially making it impossi-ble to run detailed simulations and experiments that spanthe design space. By characterizing applications, the com-munity as a whole has access to similarities that can beused to group similar applications and run fewer simula-tions to predict performance. Micro-architectural metricsenable new visions of general-purpose designs, by illumi-nating characteristics of the application, regardless of theplatform on which it is executed. In addition, characteri-zations of applications can be used to identify new, uniquebenchmarks, and combine them in the most efficient wayspossible for architecture experiments and analysis.

1

Computer games are a huge market segment and de-sign driver in the personal computer industry. They arecomputationally intensive, often including artificial intel-ligence, physical simulation, realistic graphics, and audiocomponents operating in parallel. In addition, their inter-active nature introduces real-time latency constraints. Asa result, they stress personal computers more than almostany other consumer workload. Games today are CPUbound, not GPU bound. They are also one of the fewworkloads likely to scale to multiple cores. Yet, detailedevaluation of these games for the purpose of architecturedesign is largely unexplored in academia. In this paperwe perform a preliminary characterization of two popu-lar games, Doom 3 and Warcraft III, and compare them totwo publicly available programs, PBRT (Physically BasedRay Tracer) and MESA, one of the SPEC2000 bench-marks. Our goal is to find publically available programsthat we could include in a benchmark suite to representone or more games. We compare them using the micro-architecturally independent characteristics described inEeckhout, Sampson, and Calder [3]. Each application isrun on a PowerPC G4 processor for as long as possible toobtain basic block file descriptions as described in Sher-wood, Perelman, and Calder [11]. Modified Macintoshdevelopment tools [4] and the Turandot [7] simulator areused to complete the analysis on the best simulation pointas determined by Simpoint [8]. The results are then com-pared to determine what, if any, similarities exist betweenthese applications.

The contributions of this paper are:1) We perform a characterization of sections of Doom

3, Warcraft III, PBRT, and MESA to determine architec-turally independent similarities.

2) We use a method that does not rely on source codeto perform the characterization, and make the Macintoshcode for this analysis generally available.

3) We show preliminary results of a principal compo-nent analysis comparing our data to the data obtained byEeckhout, Sampson, and Calder [3] on the SPEC2000benchmark suite using the Alpha instruction set architec-ture (ISA) where we use the Power ISA. The differencesbetween the versions of mesa indicate that such a com-parison may not be completely appropriate, motivating aneed for better methods of comparison for applicationsacross ISAs.

The rest of this paper is organized as follows: Section 2explains related work, Section 3 gives background infor-mation on the different applications being analyzed, Sec-tion 4 describes the methodology used in depth, Section5 shows results of the analysis, Section 6 enumerates spe-cific conclusions, and Section 7 outlines plans for future

work.

2 Related WorkEeckhout, Sampson, and Calder [3], use themicroarchitecturally-independent characteristics out-lined in Phansalkaal. [9] to analyze the SPEC2000benchmark suite and describe a method of reducingsimulations with this suite to a subset of the most uniquesections to obtain the most accurate results possible withthe least simulation time. They emphasize the importanceof performing microarchitecturally-independent charac-terizations to enable applications to drive the architectureand for unique characteristics of applications to drivetheir inclusion in benchmark suites. Phansalkar et al. [9]uses these characteristics to analyze the similarities ofthe SPEC benchmark suites over time and determinesubsets of the SPEC suites to run. They find that thereexists a great deal of similarity across all of the SPECsuites with the SPEC2000 suite slightly more varied thanprevious suites. The main difference over time in thesebenchmarks is the increase in the dynamic instructioncount and the increase in poor temporal data locality.Both Eeckhout and Phansalkar use principal componentanalysis as we do here to determine similarity with awide-variety of parameters. They use k-means clusteringto determine similar clusters of benchmarks, while weuse a distance measure on the principal componentsto compute distance from our target applications. Theonly other work using microarchitecturally-independentmetrics we are aware of is the work by Driesen et al. [?]where they characterize several small familar programssuch as fibonacci, nqueens, tower, quicksort, etc. usingtwo metrics, footprints and unified prediction profiles forthe memory system. They do show instruction mixes, butthere ends the similarity.

Due to the closed-source nature of games, the architec-ture research community has not performed many char-acterizations of these programs. Two papers, Bangun,Dutkiewicz, and Anido [?] and and Feng, Chang, Feng,and Walpole [?] characterize the network traffic of popu-lar on-line multi-player games. Feng et al. emphasize theimportance of the class of games known as ”first-personshooters” and include Doom in their list of such games.They demonstrate that the traffic consists of large, highlyperiodic, bursts of small packets and it targets the satura-tion of the narrowest, last-mile link to enable the slowestplayer to stay in the game. Bangun et al. focusses onthe distribution of the traffic for different games, notingthat the distributions are dependent on the game, but insome cases independent of the number of players. Mitra

2

and Chiueh [?] perform a characterization of 3D graphicsworkloads with the architectural implications for graphicsprocessors. They use three sets of benchmarks, includingViewperf OpenGL performance evaluation benchmarks,a time-demo of QuakeII, and three VRML 1.0 models.The characteristics that they evaluate have little relation-ship to those we use and consist of graphics specificmetrics including single-frame geometry bandwidth, totalprimitives, number of vertex/primitive, resolution require-ments, and rasterization requirements. None of the aboveprovide the kind of detailed general-purpose analysis weperform here.

3 Application DescriptionsWhen considering which applications to include in ourstudy, we identified games that are representative of thepopular trends in structure, design and graphics. To thisend, we chose two bestselling games from the past threeyears available on the Macintosh architecture: Doom 3and Warcraft III: Reign of Chaos. Doom 3 is a science-fiction/horror first person shooter (FPS) game. WarcraftIII is a fantasy real-time strategy (RTS) game. Both ap-plications are are several years old, but they both featuregraphics and game play which were cutting edge at thetime of their release and which still create challengingworkloads for some modern hardware. These two gamespresent a representative slice of the demands placed on theworkstations of gaming users.

PBRT is an open source physically-based ray tracingprogram published by Pharr et al. and Humphreys et al. in2004 [10]. It is able to produce photo-realistic renderingresults offline rather than in realtime. Nevertheless, it pro-vides an open reference in the context of physically-basedrendering, which is one of the main interest in moderngames. MESA is a free OpenGL work-alike library. Sinceit supports a generic frame buffer, it can be configured tohave no operating system or windowing system depen-dencies. Any number of client programs can be written tostress floating point, scalar or memory performance (or amix).

There are no good open-source game benchmarks. Thegoal in analyzing these open-source programs is to inves-tigate whether they have similar characteristics to the se-lected game applications. Without access to source code,it is difficult to analyze how much computational effortthe games put into rendering and related graphics tasks,as compared to tasks associated with animation, simula-tion or artificial intelligence. By comparing the most rep-resentative phase of these games execution with those ofthe open-source rendering programs, we attempt to deter-

mine whether the rendering programs can serve as validproxies for their commercial counterparts. Uncovered ar-chitecturally independent similarities will allow us to havegreater confidence in open-source benchmarks as repre-sentative of modern industry trends, while any differenceswill provide insight into how we can better design exper-iments characterizing game programs. More details areprovided on each application in the subsections below.

3.1 Doom 3As mentioned above, Doom 3 is a science-fiction/horrorfirst person shooter which was developed by id Soft-ware and published by Activision in August 2004;the Macintosh version was released in March 2005.(See http://www.doom3.com/) Like most modern games,Doom 3 uses multiple threads to accomplish the differ-ent tasks involved in managing and presenting the game.These tasks include artificial intelligence strategies bothto assist and oppose the player, receiving and processinguser input, score and record keeping calculations, and re-alistic image creation and rendering.

Doom 3 is famous for not only its custom-built graphicsengine, but also for the way this engine is employed to cre-ate surprisingly immersive and life-like environments andcharacters [5]. Games from the first person shooter genregenerally emphasize fast-paced action, requiring quick re-flexes and high levels of hand-eye coordination on thepart of the user, and Doom 3 is no exception. For thisreason, popular benchmarks frequently use statistics suchas ”frames per second’ to measure a game’s performanceon a particular hardware system [1]. For the user to re-main satisfied with the virtual experience provided, a min-imal apparent speed of execution and interactivity must bemaintained. However, statistics like frames per second aremore indicative of the capabilities of a specific hardwareconfiguration than they are helpful to us in understandingthe underlying characteristics of a given application.

Due to our machines’ capabilities, and to place maxi-mal load on the processor, rather than the graphics card,the program is run with the graphical quality set to ”Low”but with all image quality features enabled. These settingswere also the ones recommended for the specific hard-ware, and represent the setup which would likely be cho-sen by a theoretical user. Generally the effects of a ”Low”quality graphics setting consist of compressed mappingand textures [1].

3.2 Warcraft IIIWarcraft III is a fantasy real-time strategy (RTS) gameproduced and developed by Blizzard Entertainment which

3

was released in July 2002. Real-time strategy games placethe user in control of entire armies or cities over courseof extended tactical encounters and conflicts. WarcraftIII is also a multi-threaded application, and must performtasks similar to those outlined for Doom 3 above. Due thescale and perspective of its gameplay, Warcraft III placesless emphasis on immersive or realistic graphics, but thesheer number of units needing to be rendered and man-aged still can create heavy workloads. For this reason,frames per second is still the favored performance mea-surment among industry benchmarkers ( [2], for exam-ple). We choose to run Warcraft with the default graphicalsettings for our hardware configuration.

While Doom 3’s core gameplay is based on forcing theuser to make frequent high-speed decisions, in the War-craft gameplay paradigm the user is primarily in chargeof macromanagement, or the long-term planning and tac-tical aspects of the game. The individual actions of unitscontrolled by both players and computer opponents are alldirected on a second-to-second scale by the game’s artifi-cial intelligence functionality. In general, this creates anenvironment where user input affects the flow of applica-tion execution only over longer time scales, but consistentand rapid feedback rates are still required for user satis-faction.

3.3 PBRT

Ray tracing is a long-standing fundamental technique incomputer graphics. This technique works to render a3D scene by casting rays of light from the camera, trac-ing their paths (which might be reflected or bent by in-tervening objects and surfaces), and computing the finalcolor of each ray. We chose PBRT (Physically Based RayTracer) [10] as a representative ray tracing program due toits open source accessibility, broad coverage of ray trac-ing techniques, and well-designed infrastructure. PBRTis a photo-realistic rendering program written by Pharrand Humphreys for their book, Physically-Based Ren-dering [10]. The techniques included in PBRT includecamera simulation, ray-object intersections, light distri-butions, visibility, surface scattering, recursive ray trac-ing, and ray propagation. Sampling theory and the MonteCarlo integration method are two significant componentsof its operation.

PBRT’s main rendering loop contains four main com-ponents: the sampler, the camera, the integrators and thefilm. It takes as input a scene file, which contains informa-tion for the 3D model (mainly triangle lists), light sourceinformation, camera position, etc. The sampler then cre-ates a set of sample points on the film plane. Stratified

sampling is often used to generate the sampling patterns.For each sample point, the camera generates a ray, whichtravels from the film through the lens to the object world.The ray then passes through several physically modeledprocesses. It is tested for intersection with 3D objects. Onan intersection, one or multiple new rays are generatedfor reflection and refraction. When the ray passes throughtranslucent volumes such as jade or smode, scattering willalso take place.

Radiance is calculated by the integrators according tothe light source, object material, and other physical con-straints during any intersections or along the ray’s path.Occlusion tests are also made in order to generate shad-ows. Integration is the key operation in rendering taskssuch as volume scattering and soft shadows from arealight. Monte Carlo integration is a method for usingrandom sampling to estimate the values of the integrals,which generally do not have analytic solutions.

Finally, the integrators send the sample ray and its asso-ciated radiance to the film, which stores the contributionfor that ray in the image. When all the samples are pro-cessed, the final image is generated. The output imageis in OpenEXR format, which preserves the image’s highdynamic range information. The format developers’ website, www.openexr.org, has full details on the image fileformat.

3.4 MesaMESA is a free OpenGL work-alike 3-D graphics libraryauthored by Brian E. Paul and written in ANSI C. Sinceit supports a generic frame buffer it can be configured tohave no operating system or windowing system depen-dencies. Any number of client programs can bewrittento stress floating point, scalar or memory performance(or a mix thereof). Output can be written to image filesfor verification. The input data is a 2D scalar field. Thescalar data is mapped to height, creating a 3D object withexplicit vertex normals. Contour lines are mapped ontothe object as a 1D texture. The output is a 2D imagefile in PNG format. More information can be found athttp://www.mesa3d.org/.

4 MethodologyAll applications are run on an Apple Macintosh Miniwith a 1.5 GHz PowerPC G4 7447A processor and aRadeon9200 graphics card. We use a modified versionof the Macintosh tracing tool Amber, called AmberBBV,to trace as many instructions as possible in each appli-cation while collecting basic block vectors (bbv) as de-

4

scribed in Sherwood, Perelman and, Calder [11]. The ini-tial block size used is 100M. This detailed basic blockinformation is then translated to a BBV file for 600Mintervals. Since our comparison and characterization in-volve games, which can run forever, it is not feasible toanalyze every interval and use clustering analysis as de-scribed in [3]. Therefore, we analyze only the most rep-resentative 600M interval in the trace with 30 billion in-structions - the longest trace we are able to capture. Sim-Point [8] is used for phase analysis to determine the start-ing simulation point for the highest weighted 600M inter-val. In the final tracing step Amber is used again to tracethis interval. We use the Macintosh tool acid and Tu-randot(a PowerPC architecture simulator [7]) to captureall of the microarchitecturally-independent characteristicsdescribed below. Instruction mix, working set and datastream strides can be retrieved from acid, which producesmemory access information at both instruction level andprogram level. A modified version of Turandot is usedto capture information about register traffic, branch pre-dictability and instruction-level parallelism.

Several aspects of this process are explained in moredetail in the subsections below. Section 4.1 outlines thespecific characteristics identified and how our methodmay differ from that of Eeckhout, Sampson, and Calder[3]. Section 4.2 describes the issues and methods behindcapturing traces from the interactive game applications.Section 4.3 details the changes made to the Macintosh de-velopment tools and the Turandot simulator to capture themicroarchitecturally-independent characteristics used forour evaluation. Finally, section 4.4 gives an overview ofthe statistical analysis methods used to analyze the gath-ered data.

4.1 Characteristics

The microarchitecturally-independent characteristicsused in our evaluation consist of instruction mix,instruction-level parallelism, register traffic, working setsize, data stream strides, and branch-predictability. Weattempt to match the measures used by Eeckhout, Samp-son, and Calder [3] and Phansalkar, Joshi, Eeckhout,and John [9]as closely as possible. A brief descriptionof these characteristics is included here, but for detailsplease refer to the original papers.

Instruction mix. Our focus is on games and renderingprograms where there are often a significant number ofvector operations. Therefore, we slightly modified Eeck-hout et al.’s approach and include the percentage of loads,stores, branches, integer operations, floating-point opera-tions, cache control instructions, and vector operations.

Instruction-level parallelism (ILP). To quantify theamount of ILP, we consider an idealized out-of-order pro-cessor model where only the window size is limited. Wecompute the number of independent instructions there arewithin the current window for window sizes of 32, 64, 128and 256 in-flight instructions. No-ops are not counted asindependent instructions. Moreover, when measuring de-pendence due to memory access, we assume that readsand writes are performed in blocks of 32 bytes.

Register traffic. The register characteristics collectedinclude the average number of input operands to an in-struction, the average degree of use, and register depen-dency distance statistics. The average degree of use is de-fined as the average number of times a register instance isread since it is written. The register dependency distanceis the number of dynamic instructions between writing aregister and reading it.

Working set. The working set size is computed sepa-rately for the instruction and data streams. The number ofunique 32-byte blocks touched and the number of unique4KB pages touched for both instruction and data accessesare recorded for each execution interval (600M instruc-tions).

Data stream strides. Data stream strides are either con-sidered local or global. Local strides are determined bycapturing the distribution of strides exhibited by each in-dividual load or store in the program, i.e. the distances be-tween the successive memory accesses of that instruction.Frequency counts for are maintained for certain stride dis-tances (i.e. the count a local load is < 8, < 64, < 512etc.) Global strides are similar to local strides, but insteadof keeping track of the stride of each load or store in theprogram, we keep track of the stride between temporallyadjacent memory accesses (loads and stores are trackedseparately). These vectors can become very large so mem-ory vectors are not allowed to grow beyond 200,000 ele-ments. All indices are then calculated modulo the maxvector size. [6]

Branch predictability The most important characteris-tic of branch predictability is the dynamic predictor per-formance for the program during execution. To capturebranch predictability in a micro-architecturally indepen-dent fashion we use the same PPM predictors as Eeckhoutet al. [3]. Eeckhout et al. consider four variations of thePPM predictor: GAg, PAg, GAs and PAs. G means globalbranch history, whereas P stands for per-address or localbranch history; g means one global predictor table sharedby all branches, and s means separate tables per branch.They emphasize the view that the PPM predictor is a the-oretical basis for branch-prediction because it attains theupper-limit performance of current branch predictors. In

5

our experiment, we set the order of the PPM predictor to13, which is sufficiently large to satisfy the upper boundrequirement.

4.2 Interactive Complications

One of the major challenges to performing the kind ofphase analysis provided by SimPoint is capturing a rep-resentative fraction of the application’s execution phases.When there is no set run time for a program because theperiod of execution is dependent on the whims of a user,the selection of starting and ending trace points could be-come arbitrary choices within an infinite stream of in-structions and unknown number of phases. To capture arepresentative area of execution, we use the ’Quick Load’feature of the game applications to script the loading of aninteresting game state and used further scripting to ensurethat tracing occurred a measurable distance after the load-ing process began. For Doom 3, 30 billion instructionsproves to be about the longest possible trace lengths froma purely practical point of view, and 500 million instruc-tions are initially skipped in attempt to bypass the initialloading state.

Another challenge specific to the user interactive ap-plications domain is the dichotomy between ensuring ex-perimental repeatability and providing reasonable input tothe application so as to generate realistic behavior. Thiscontrasts with the deterministic nature of tasks such asrendering (i.e. PBRT and MESA). In general, script-ing user input is a valid solution to this problem, but inthe case of our very real-time behavior dependent appli-cations (games) we encounter problems. For basic in-application control and input we use AppleScript, but dueto the massive slowdown experienced by traced applica-tions, it is impossible to provide timed inputs to match theinput capabilities of the traced application and still pro-duce ’realistic behavior’ (i.e. spatial navigation through aDoom 3 level). In general, we compensate for this by sim-ply choosing to trace a game state not dependent solelyon user input for interesting behavior. If graphical de-mands, artificial intelligence, and automatic services areactive enough at certain point in game play, the effect ofa brief lack of user input on execution is negligible. Toaccomplish this, the game is navigated manually to an in-teresting state (the conclusion of a cinematic with an en-emy attack in the case of Doom 3), and then the ’quicksave’ and ’quick load’ features are used to allow tracingto begin with the game in same the precise state everytime. Thus we preserve repeatability while still capturingan interesting and realistic slice of program execution.

Our approach to extracting traces Warcraft’s execution

is nearly identical to the method used with Doom 3 asoutlined above. The actual tracing parameters are pre-cisely the same, and only minor modifications are neededto the script in charge of setting up the application stateand providing user input. As discussed in 3.2, one benefitof Warcraft III’s game play paradigm over that of Doom3 in this situation is that the user is mainly in chargeof macromanagement. Individual units’ actions are allmostly controlled by the game’s artificial intelligence al-gorithms. For this reason, given a complex game state ex-amined over a scale of several seconds, there is no reasonto require any user input at all (this behavior may even beconsidered typical). By customizing a multi-player sce-nario it is possible to create a game state as complex asmay be desired. However, in the specific trace analyzedin this paper, we choose to use a default scenario pro-vided by the manufacturers for scripting simplicity; wedeem it complicated enough to meet our phase coveragerequirements. As with Doom 3, we automate the pro-cess of opening the application and loading the desiredstate, whereupon tracing automatically begins. 30 billioninstructions are traced. Again, as with Doom 3, a fixedwindow of 500 million instructions corresponding to theloading phase are skipped. The resulting basic block vec-tor is processed in the same way, resulting a detailed traceof the block associated with the execution phase weightedhighest by our SimPoint analysis.

4.3 Analyzing Software

AmberBBV is an extension based on the external libraryof Amber, a performance analyzing tool for the PowerPCarchitecture designed to trace program instructions. Theexternal library provides an interface for analyzing thetrace. The trace data is then parsed and fed to the BBVmodule which produces the BBV profile. To identify abasic block, the programs simply check each parsed in-struction to see whether it is a branch instruction. If so,then we have reached the end of a basic block and mustbegin a new basic block. The address of the first instruc-tion in a basic block is used as the identifier for that block.We keep all the basic block records in a hash table and in-crement the counter of a specific basic block when an in-struction in that basic block is executed. Our BBV moduleis based on BBTracker, which was released by Calder etal. for producing BBV data with SimpleScalar [11]. TheBBV file can then be used for phase analysis and to pro-duce weighted simulation points. The extended featuresof AmberBBV include:

1. Parsing the trace on the fly to generate BBV records.AmberBBV analyzes the trace data to produce BBV

6

records. Since traces are usually huge in size, and some-times all we need is the BBV record file, users can decidewhether or not to save the trace while producing the BBVfile. A BBV file is produced for each thread.

2. Configuring the trace options for different threads.When examining the traces from games, the significantphases of execution may occur at different times for dif-ferent threads. However, Amber doesn’t distinguish con-figurations between threads, hence it is hard to monitor thetrace progress of a particular thread. To provide more so-phisticated per thread analysis, AmberBBV keeps a sepa-rate record for each thread, and users can specify the num-ber of skipped instructions for each thread explicitly.

Acid is a Macintosh development tool that analyzes thetrace files [4]. From an instruction trace it can record andcalculate the instruction mix, assembly code, static branchmisprediction rates, memory access profiles for both in-struction and data, etc. Although Turandot can also beused for gathering information about working set and datastream strides, we directly process acid’s output for sim-plicity. The assembly code produced by acid is used forverification purposes.

Turandot is a PowerPC processor simulator [7]. Itparses a PowerPC instruction trace and simulates the ex-ecution of the traced instructions on a user configured ar-chitecture. Since we are measuring the microarchitectureindependent characteristics, we treat Turandot mainly asa trace parser, which feeds our module with each instruc-tion’s type, address, input and output. Our module thenworks separately to analyze the ILP, register traffic, andbranch predictability for the trace. To ensure the instruc-tions are fetched in the correct order, we turned on per-fect branch prediction in Turandot and simulate the PPMbranch predictor in our own module.

4.4 Statistical Analysis Methods

To compare our selected applications’ characteristics in afair and unbiased way, we need an analysis method whichremoves variable correlation from our data set. To thisend we find it appropriate to follow in the footsteps ofEeckhout et al. [3] and use Principal Components Analy-sis (PCA). PCA is ideal because it allows us to eliminateany variable correlations which would otherwise skew oursimilarity analysis, while additionally providing a tech-nique for reducing dataset dimensionality with controlledinformation loss. The second aspect is less critical to ourwork than it was to Eeckhout et al.’s because we are deal-ing with only six representative intervals rather than tensof thousands of intervals representing a complete trace ofall applications, but it is still a useful property for creating

a simple comparison metric.

PCA can best be summarized as a measure of variancefor a dataset containing data for many variables. Eachmember of the the dataset is called a case, and PCA willprovide a way to measure the relative similarity of thecases, based on the variances of the variables across thecases. In our analysis, each case represents one of ourtraced, SimPoint-selected intervals. (Note in our examplewe have traced only one interval for each program thread:two Doom, two Warcraft, one PBRT, and one MESA fora total of 6 intervals.) Each variable represents one of themicroarchitecturally-independent characteristics detailedin section 4.1.

By creating Principal Components (PCs) from linearcombinations of the original variables, we remove the ef-fect of correlations between the variables. This makes itpossible to analyze the data without fear that it has beenskewed by any inter-variable relationships. Initially, weform as many PCs as there are original variables. By cal-culating the variance in the dataset accounted for by eachPC, it is then possible to eliminate from further consid-eration those PCs which do contribute significantly to thetotal system variance. This is how we can reduce datasetdimensionality while quantitatively controlling informa-tion loss. By examining the way in which each variablecontributes to the retained PCs, in terms of the coefficientassociated with a variable in a given PC’s formative linearcombination, it is possible to give a meaningful interpreta-tion to the PCs in terms of the original microarchitecturalcharacteristics. Further details of the PCA process are de-scribed below.

As in Eeckhout et al. [3], we put all the data into amatrix, with each row representing a traced interval, andeach column representing a variable (i.e. a characteristic).Each variable column is normalized to a mean of 0 anda variance of 1. The purpose of this is to avoid variableswith large variance due to their innate scale having undueinfluence on our results. The resulting matrix is then oper-ated on by the PCA. The output of PCA is another matrix,with each row representing a case and each column repre-senting a Principal Component. We will refer to the PC inthe first column as PC1, and the PC in the second columnas PC2, etc. The PCs are ordered according to their vari-ance in descending order. Consequently, we know PC1provides the most information about the variance withinour data set, followed by PC2, and so on. By setting adesired retained variance lower-bound threshold we caneliminate the majority of the PCs. For example, from thePCA of our six traces (PCAnalysis I), we select the first 2PCs because they contribute to at least 70%of the varianceof all the variables.

7

We normalize the selected PCs to a have variance of1 to complete the removal of any variable correlation ef-fects. We can then represent each case (or traced interval)as a vector, with each retained PC as a component of thevector. Euclidean distance can then be used to comparetheir similarity. Unlike Eeckhout et al. [3], we do not fur-ther the process with clustering techniques on our PCAdata; instead we use the selected Principal Components tocompare the representative intervals directly.

To provide further insight into the domain space andanalysis technique we perform a second PCA analysis(PCAnalysis II). This time we incorporate not only our sixnew traces, but also data from fifteen of the cases stud-ied in Eeckhout et al. [3]. These additional cases weregathered from the SPECCPU2000 benchmark suite. Wechoose to include all the floating point benchmarks soas to provide a wide spread of program behavior domainspace against which we can compare our own traces. Toproduce valid comparisons between the two data sets, wechoose a subset of 41 of the original characteristics whichwe believe should not affected by the differences in ourmethodology as compared to Eeckhout et al. [3]. Specif-ically, we were not able to directly compare working setand some instruction mix statistics. In this case, our PCAindicates that we should retain the first 4 PCs, which ac-count for 70% of the variance across all the variables.

It is important to realize that the similarity metric pro-vided by PCA is not absolute, but instead a relative mea-sure. That is, the distance between the same data pointpair in two different PCAs will be different if the othercases included in the PCA are changed. This is becauseeach PCA provides a metric only in terms of those caseswhich are included in that PCA. The power of PCA is thatit allows us to determine which cases are most or leastsimiliar, while simultaneously gaining insight into whichvariables are the root cause of said similarity.

5 ResultsFirst we show selected results for each category in section4.1 and compare those results individually. Then we per-form a PCA (PCAnalysis I) to show the composite simi-larities across our six threads. In the final PCA (PCAnal-ysis II), we include results from Eeckhout et al. [3].

5.1 Instruction MixFigure 1 shows a breakdown of the instruction mix usedby each of our six threads. By far integer operations arethe most common for all threads. MESA and PBRT in-clude the largest amount of floating point operations -

much more than either of the game threads. This maybe due to the use of the graphics processors by the gameprograms. Probably for the same reason, the games makea much greater use of vector and cache control operationsas well.

5.2 Instruction-level Parallelism (ILP)

Figure 2 shows that all threads exhibit relatively largeamounts of ILP. The differences seem relatively smallwhen considering the 32B window size, but as the win-dow size grows, the Doom threads seem to hit a limit,indicating a relatively lower level of ILP than any of theother 4 threads.

5.3 Register Traffic

Register traffic characteristics include the average num-ber of input operands to an instruction, the average de-gree of use, and the register dependency distance. Theaverage number of input operands was similar for eachthread, ranging from 1.30 for Doom 3 Thread 2 to 1.57for MESA. The average degree of use varied slightly morefrom 1.11 for Doom 3 Thread 1 to 1.73 for PBRT.

The register dependency distance results are shown inFigure 3. Once the distance exceeds 2, the shape of the bargraphs and their relationship is very similar, with a smallswitch between MESA and PBRT in the < 8 column. Itshould be noted that there are still a fair number of ref-erences that have a register dependency distance greaterthan 64 for all of the threads, especially Doom 3 Thread 1,which seems to reach a limit somewhere just above 40%.

5.4 Working Set

The working set comparison shown in Figure 4 illustratessome of the biggest differences between the applications.Doom 3 Thread 2 is clearly performing most of the mem-ory accesses in both the instruction and data caches. Also,Doom 3 Thread 2 and MESA clearly have significantlylarger working set sizes than any of the other threads.

5.5 Data Stream Strides

Figure 6 shows the relationship between global and lo-cal data strides for each thread. Doom 3 Thread 1 has amajority of its global load strides at 64 or less, while theother threads have only 60% of their global load stridesat 4K or less. The majority of global store strides are 64or less, with MESA standing out as having global storestrides that keep rising. Almost 80% of local load strides

8

are 64 or less for all programs, but to reach the 80% markfor local store strides requires a stride of 512 or less.

5.6 Branch PredictabilityAll of the threads are very predictable as demonstrated inFigure 5. The GAg predictor gets the highest mispredic-tion rates because there are more conflicts in this predic-tor. Doom 3 Thread 2 is least predictable of the threads bya relatively large amount, but is still very predictable withits worst misprediction rate just above 1%.

5.7 PCA I: Doom, Warcraft, PBRT, MESAFigure 7 shows the comparison of the vectors represent-ing 6 threads; one each for PBRT and MESA, and twoeach for Doom and Warcraft. Only the comparison inthe PC1 and PC2 domain space is shown here becausethey account for at least 70% of the variability accord-ing to the PCA. The coefficients of each PC are shown inFigure 8. Because two of the components capture sucha high amount of the variability, the relative ”difference”between these applications is captured in the distance theyare apart in the graph.

Our results indicate that Doom 3 and Warcraft III arethe least similar of our traced applications. Even the waywork is divided between the major threads appears to bedisparate: while both Warcraft threads exhibit very simi-lar characteristics, the two Doom threads are furthest apartwhen plotted in PC1/PC2 space. This dichotomy likelysprings from the huge difference between the Doom 3threads in terms of their use of vector operations as wellas their widely ranging working set sizes.

MESA, PBRT, and both Warcraft threads appear to benearly identical with regard to our first principal compo-nent, and more divergent in terms of the second. Thisreflects a similarity between the programs’ characteristicswhich are assigned a large coefficient value by the PCA.Compared to Doom 3 and MESA, PBRT is most similarin its behavior to WarCraft III. However, Doom 3’s behav-ior is not similar to either PBRT or MESA. One possibleexplanation is that Doom 3 succeeds in implementing al-most all graphical computations on the GPU, and is mak-ing heavy use of vector operations.

PCA enables us to see relative values for several char-acteristics of a particular thread by inspecting the coef-ficient graph. As an example, consider MESA. It has avery high PC2 value, which means it has low mispredic-tion rates, high levels of ILP at larger window sizes, largenumbers of input operands and degree of use, many float-ing point operations but few vector or cache operations,small global store strides, large local strides, and small

working set sizes other than data32B. (All considered rel-ative to the other threads included in the analysis.) Thesign of the coefficient means larger or smaller raw datavalues, and the size of the coefficient indicates the degreeof largeness or smallness.

5.8 PCA II: Including SPEC2000, Alpha-ISA Data

Because of the relative nature of PCA, we thought it use-ful to attempt to compare a number of other benchmarksto those used in this study. We also wanted to see howour results compared to those in the previous literature,specifically those presented in Eeckhout et al. There wereseveral concerns when attempting this comparison. First,the data we have from Eeckhout et al. is aggregate dataover many intervals of 100M instructions, where our datais over one large 600M interval. To compensate for thiswe removed characteristics that were influenced by the in-terval size, and depended on the Simpoint methodology tohave identified a representative point in our applications.Second, we are analyzing traces from a dynamic run ofthe applications in question, where Eeckhout et al. aresimulating large portions of the programs. Third, we areusing a different instruction set architecture, and compilerthan Eeckhout et al. In the final analysis, we determinedthat these latter two issues dominate the comparison be-low. This determination is largely because of the greatdisparity between our MESA and Eeckhout’s MESA inthe PCA. We did compute the distance in the four PC di-mensions between our game threads and all of the otherdata points. It is true that the open source thread clos-est to three of the game threads is PBRT. The exceptionis Doom 3 thread 1 which is relatively far away from allother threads, but closest to one of Eeckhout’s data pointsfor a SPEC benchmark. This leads us to believe that, ifthe differences that do exist are taken into consideration,PBRT performance has the most potential of these pro-grams for shedding light on micro-architecture enhance-ments for these games (particularly Warcraft). The lackof compatibility between the MESA results highlightsthe benefit of continued research for charcteristics andmethodologies that can be used across ISAs, compilers,and ideally, tracing and simulation methodologies.

6 Summary and Conclusions

This paper identifies salient characteristics of two games,Doom 3 and Warcraft II, and compares them to two open-source programs, PBRT and MESA of the SPEC2000

9

Loads Stores Branches Vector Int Ops FP Ops Control0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Instruction Types

Per

cent

age

Doom 3 Thread 1Doom 3 Thread 2WarCraft III Thread 1WarCraft III Thread 2MESAPBRT

Figure 1: Instruction Mix

32 64 128 2560

2

4

6

8

10

12

14

16

18

Window Size

ILP


Figure 2: Instruction Level Parallelism

=1 <2 <4 <8 <16 <32 <640

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Register Dependency Distance

Per

cent

age


Figure 3: Register Dependencies

Instruction 32B Instruction 4KB Data 32B Data 4KB0

1

2

3

4

5

6

Memory Access Category

Blo

cks

Tou

ched

(lo

g ba

se 1

0)Doom 3 Thread 1Doom 3 Thread 2WarCraft III Thread 1WarCraft III Thread 2MESAPBRT

Figure 4: Working Set Comparison

GAg PAg GAs PAs0

0.002

0.004

0.006

0.008

0.01

0.012

Predictor Type

Mis

pred

ictio

n R

ate


Figure 5: Prediction Comparison

10

Figure 6: Data Stream Stride Comparison

−2 −1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

2

PC 1

PC

2


Figure 7: PCAnalysis I: Doom, Warcraft, PBRT, and MESA

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Branch

Pred.

ILP Reg. Traff ic Inst. Mix Data Strides Working

Set

Characteristics

PC

Fa

cto

rL

oa

din

gs

PC1

PC2

Figure 8: PCAnalyis I: Coefficients for PC1 and PC2

11

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

PC 1

PC

2

Meng et al. Doom 3 Thread 1


Meng et al. WarCraft III Thread 1


Meng et al. MESA

Meng et al. PBRT

Eeckhout et al. MESA

Eeckhout et al. SPEC

Figure 9: PCAnalysis II: Combined Meng, Eeckhout dataPC1 and PC2 only

−1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 3.5−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

PC 1

PC

2





Meng et al. MESA

Meng et al. PBRT

Eeckhout et al. MESA

Eeckhout et al. SPEC

Figure 10: PCAnalysis II: Combined Meng, Eeckhout dataPC3 and PC4 only

benchmark suite. Since the game programs are multi-threaded, we analyze two threads from each game and onethread from the others. Detailed comparisons are given oneach of six main categories of micro-architectural charac-teristics. A few highlights are: 1) The game programsuse more specialized instructions than MESA and PBRT,probably because they are better optimized for the archi-tecture. 2) All threads show a fair amount of ILP, but aswindow size grows the Doom threads seem to hit a limitaround 10. 3) All of the threads have register dependen-cies that exceed 64 instructions. 4) The Doom threads arehighly dissimilar, while the Warcraft threads are very sim-ilar. For example, nearly all the memory blocks touchedby Doom are touched by thread 2, in both the instruc-tion and data accesses, while the memory blocks touchedby the Warcraft threads are more comparable and muchsmaller in number. 5)Global load and local store stridesare bigger, in general, for all threads than global storeand local load strides. 6)All threads are very predictablein theory, with Doom 3 Thread 2 being the least pre-dictable. Our Principal Component Analysis shows themost commonality between PBRT and the Warcraft IIIthreads, and the largest disparity between MESA and theDoom 3 threads.

In the final analysis, we determine that the games aredifferent enough from our two initial candidates to war-rant analyzing more games and more open-source pro-grams to track down similarities. There is almost an in-finite variety within the game genre itself, including first-person shooters vs. real-time strategy vs. sports simu-lations, different types of game logic, different amountsof physical simulation, different amounts of GPU offload,different amounts of optimization, etc.

The similarity between Mesa/PBRT and the games,compared to the diversity among the game threads, sug-gests just how noisy this space is. This difference seemssimilar to the variance among games, and PBRT, espe-cially, seems much more similar to these games than anyother SPEC benchmarks, suggesting that optimizing forMesa/PBRT is at least going to lead architects in the rightdirection. It certainly illustrates the need to understandthis space better and develop better benchmarks. No onebenchmark or even a small set will be representative ofthe game space: it will require many benchmarks.

Games are fundamentally important in the marketplaceand a major design driver for CPUs, not just GPUs, yetacademic architects are mostly unable to study them be-cause we lack benchmarks and hence lack any insight intogames’ requirements. This is a first step to addressingthis shortfall and elucidating some research directions thatwould allow academics to play a relevant role in this ex-panding market.

7 Future Work

Future work in this area falls into three general categories;work on evaluating more games and more benchmarks todiscover similarities, refining characteristics and method-ology to allow comparisons across ISAs and compilers,and expanding these results by using full-speed dynamicruns with performance counters to collect pertinent data.

12

AcknowledgmentsThis work is supported in part by the National Sci-ence Foundation under grant nos. NSF CAREER awardCCR-0133634, and CNS-0340813, and a grant from In-tel MRL. Great appreciation goes to Lieven Eeckhout forthe use of his SPEC2000 data and many clarifications.We would also like to thank Yingmin Li and KarthikSankaranarayanan for assisting us with Turandot, andGreg Humphreys for his helpful input.

8 Appendix A: Raw Dataof Microarchitecturally-independent Characteristics

See Figure 11.

9 Appendix B: BBV file format

A BBV file contains per interval and per basic block sta-tistical information of each thread.

Each line represents an interval. After a T as the begin-ning flag, each line is divided into several word blocks inthe format :B:N, each word block shows the number of in-structions executed in this basic block during the currentinterval. B is the basic block ID and N is the number ofinstructions for this basic block. The basic block infor-mation is sorted in ascending order according to the blockID. A typical line in BBV with interval 100:

T:1:34 :3:12 :4:54

10 Appendix C: AmberBBV Speci-fications

AmberBBV entitles the user to generate a BBV file onthe fly, with a specified interval length. The BBV filecan then be used for phase analysis to generate simpoints.AmberBBV configures and generates BBV files per eachthread and is named in the same way as the trace files.Each BBV file contains lines of intervals with basic blockstatistics.

AmberBBV contains 3 modules:Interface with the amber kernel, which initialize the

AmberBBV according to the user commands and also ac-cept trace data from the amber kernel.

Trace Parser takes in the trace data from the amber ker-nel, analysis it based on the opcode, gather statistic infor-mation for each thread, and pass the information to BBVtracker. The statistical information and its usage are asfollows: 1. Per thread number of instructions caught byamber: If the user specified the number of skipping in-structions for a thread, the program should use the perthread information to determine when to start writing thetrace file. It is also used for printing out the tracing sta-tus on the fly. 2. Per basic block number of instructionscaught by amber. The BBV tracker is the consumer forthis information.

BBV Tracker is based on BBtracker released by Calderet al. [11], but is extended for multi-threaded programs. Itbuilds a hash table for each basic block, with the startingpc as the key value. Each time the trace parser encoun-ters an end of a basic block, it will send the basic blocklevel instruction count to BBV tracker, and BBV trackerwill add them to the BBV statistic counter in the currentinterval. If a basic block turned out to be crossing two in-tervals, it will automatically be split in BBV tracker. Sothe BBV tracker is accurate. Note that in [11], the basic

13

block instruction count will always be added to the cur-rent interval, so it is an approximation. In most cases, thisdoesn’t affect the phase analysis. But when a basic blockhas a significant size compared with the interval size, thismay introduce more error to the phase analysis.

11 Appendix D: Turandot Modifica-tions

Turandot is modified to capture microarchitecturally-independent characteristics. Four versions of Turandotis generated, each one is responsible to capture one typeof information, such as working set, ILP, register trafficand branch predictability. We insert the code into themain loop of Turandot’s simulator. We are using Turandotmainly as a parser of the PowerPC instruction set. Everyclock cycle, the decoded instruction queue is checked fornewly fetched instructions. Since the instructions are inorder at this stage, we are able to feed our analysis codewith the instruction in sequence. No-ops are ignored. Tu-randot provides information for instruction address, regis-ter input and output, memory access address, and instruc-tion type. With these information, we are able to processour analysis.

References[1] Kyle Bennet. The official doom 3 hardware guide.

Jul. 2004.

[2] Chris Connolly. Almost cinematic graphics : nvidiageforcefx 5600. page 7, Apr. 2003.

[3] Lieven Eeckhout, John Sampson, and Brad Calder.Exploiting program microarchitecture independentcharacteristics and phase behavior for reducedbenchmark suite simulation. In IISWC05, Oct. 2005.

[4] Apple Computers Inc. Computer hardware under-standing developer tools. Jan. 2004.

[5] Greg Kasavin. Doom 3 for pc review. Aug. 2004.

[6] J. Lau, S. Schoenmackers, and B. Calder. Struc-tures for phase classification. In Proceedings ofthe 2004 IEEE International Symposium on Per-formance Analysis of Systems and Software (IS-PASS04), Mar.

[7] Jaime H. Moreno and Mayan Moudgill. Turandotusers’s guide. In IBM Research Report RC 21968,February 2001.

[8] Erez Perelman, Greg Hamerly, and Brad Calder.Picking statistically valid and early simulationpoints. In Proceedings of the 12th InternationalConference on Parallel Architectures and Compila-tion Techniques, Sep. 2003.

[9] A. Phansalkar, A. Joshi, L. Eeckhout, and L. K.John. Measuring program similarity: Experimentswith spec cpu benchmark suites. In Proceedingsof the 2005 IEEE International Symposium on Per-formance Analysis of Systems and Software (IS-PASS05), pages 10–20, Mar. 2005.

[10] Matt Pharr and Greg Humphreys. Physically-BasedRendering: From Theory to Implementation. Else-vier Science and Technology Books, 2004.

[11] Timothy Sherwood, Erez Perelman, and BradCalder. Basic block distribution analysis to findperiodic behavior and simulation points in applica-tions. In International Conference on Parallel Ar-chitectures and Compilation Techniques, Sep. 2001.

14

Figure 11: Raw Data

15

Documents

Comparing Doom 3, WarCraft III, PBRT, and MESA Using Micro ... › ~skadron › Papers › TR-CS-2007-04.pdfyears available on the Macintosh architecture: Doom 3 and Warcraft III: