12
.......................................................................................................................................................................................................................... ABENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK SUITE .......................................................................................................................................................................................................................... BENCHMARK CONSUMERS EXPECT BENCHMARK SUITES TO BE COMPLETE, ACCURATE, AND CONSISTENT, AND BENCHMARK SCORES SERVE AS RELATIVE MEASURES OF PERFORMANCE.HOWEVER, IT IS IMPORTANT TO UNDERSTAND HOW BENCHMARKS STRESS THE PROCESSORS THAT THEY AIM TO TEST.THIS STUDY EXPLORES THE STRESS POINTS OF THE EEMBC EMBEDDED BENCHMARK SUITE USING THE BENCHMARK CHARACTERIZATION TECHNIQUE. ......Benchmarking is as much an art as it is a science, and at times involves a leap of faith for the benchmark consumer. Most people trust benchmark suites to be com- plete, accurate, and consistent. By default, when used to compare one processor to an- other, benchmark scores serve as a relative measure of performance. Rarely do the benchmark consumers understand or verify the underlying stresses that the benchmarks place on the processors. This study examines the stress points of the Embedded Micro- processor Benchmark Consortium (EEMBC) suite, using the benchmark characterization technique. Benchmark characterization involves find- ing a set of unique characteristics to classify benchmarks. 1 These characteristics should be predictive of performance on a wide vari- ety of platforms, and should allow bench- mark consumers to find good proxies for their own applications in the benchmark set. The technique aims to find just the right characteristics to reveal the benchmarks’ required hardware capacities. Thus, system designers can use benchmark characterization to classify benchmarks and to discover if two benchmarks stress the system in similar or different ways. Core concepts Processor performance depends on the types of workloads used, and designers typi- cally use benchmark suites with the initial as- sumption that the workloads they provide are representative of user programs. EEMBC benchmarks represent a wide variety of workloads and provide an industry standard tool for understanding processor perfor- mance. However, few people truly under- stand what these benchmarks are doing. EEMBC benchmarks attempt to represent user programs from several embedded disci- plines. By understanding these benchmarks’ inherent program behavior, system develop- ers can make comparisons to their own applications. Designers and researchers can use bench- mark characterization to parameterize a workload so that it is quantifiable by several abstract attributes. They then can use the measured attributes to indicate program sim- ilarity. Using these characteristics, along with inherent knowledge of the actual application Jason A. Poovey Thomas M. Conte Georgia Institute of Technology Markus Levy Shay Gal-On Embedded Microprocessor Benchmark Consortium .............................................................. 18 Published by the IEEE Computer Society 0272-1732/09/$26.00 c 2009 IEEE

ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

..........................................................................................................................................................................................................................

A BENCHMARK CHARACTERIZATIONOF THE EEMBC BENCHMARK SUITE

..........................................................................................................................................................................................................................

BENCHMARK CONSUMERS EXPECT BENCHMARK SUITES TO BE COMPLETE, ACCURATE,

AND CONSISTENT, AND BENCHMARK SCORES SERVE AS RELATIVE MEASURES OF

PERFORMANCE. HOWEVER, IT IS IMPORTANT TO UNDERSTAND HOW BENCHMARKS

STRESS THE PROCESSORS THAT THEY AIM TO TEST. THIS STUDY EXPLORES THE STRESS

POINTS OF THE EEMBC EMBEDDED BENCHMARK SUITE USING THE BENCHMARK

CHARACTERIZATION TECHNIQUE.

......Benchmarking is as much an artas it is a science, and at times involves a leapof faith for the benchmark consumer. Mostpeople trust benchmark suites to be com-plete, accurate, and consistent. By default,when used to compare one processor to an-other, benchmark scores serve as a relativemeasure of performance. Rarely do thebenchmark consumers understand or verifythe underlying stresses that the benchmarksplace on the processors. This study examinesthe stress points of the Embedded Micro-processor Benchmark Consortium (EEMBC)suite, using the benchmark characterizationtechnique.

Benchmark characterization involves find-ing a set of unique characteristics to classifybenchmarks.1 These characteristics shouldbe predictive of performance on a wide vari-ety of platforms, and should allow bench-mark consumers to find good proxies fortheir own applications in the benchmarkset. The technique aims to find just theright characteristics to reveal the benchmarks’required hardware capacities. Thus, systemdesigners can use benchmark characterizationto classify benchmarks and to discover if two

benchmarks stress the system in similar ordifferent ways.

Core conceptsProcessor performance depends on the

types of workloads used, and designers typi-cally use benchmark suites with the initial as-sumption that the workloads they provide arerepresentative of user programs. EEMBCbenchmarks represent a wide variety ofworkloads and provide an industry standardtool for understanding processor perfor-mance. However, few people truly under-stand what these benchmarks are doing.EEMBC benchmarks attempt to representuser programs from several embedded disci-plines. By understanding these benchmarks’inherent program behavior, system develop-ers can make comparisons to their ownapplications.

Designers and researchers can use bench-mark characterization to parameterize aworkload so that it is quantifiable by severalabstract attributes. They then can use themeasured attributes to indicate program sim-ilarity. Using these characteristics, along withinherent knowledge of the actual application

Jason A. Poovey

Thomas M. Conte

Georgia Institute

of Technology

Markus Levy

Shay Gal-On

Embedded

Microprocessor

Benchmark Consortium

..............................................................

18 Published by the IEEE Computer Society 0272-1732/09/$26.00 �c 2009 IEEE

Page 2: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

target for a system, helps them select the mostrelevant benchmarks from the EEMBCsuites.

Conte and Hwu argue that benchmarkcharacterization should use metrics that re-flect a benchmark’s preferred architecturaldesign point.1 Other researchers have per-formed similar analyses using many differenttechniques.2-6 Trace simulation, statisticalanalysis, binary instrumentation, and on-the-fly analysis all attempt to capture inher-ent program characteristics. For example, adesigner can run simulations to find the nat-ural cache size such that the cache encoun-ters only compulsory misses (that is, missesthat occur when the program first requestsdata), but never misses because of limitedcapacity. The instruction set also greatlyaffects benchmark behavior, so it is usefulto determine the mix of function units inthe processor pipeline that causes the bench-mark to run most efficiently. In this article,we use several microarchitecture-dependentand independent attributes to characterizethe EEMBC benchmark suites. Becausemicroarchitecture-dependent and indepen-dent attributes are simply measurementsof system events, our research also incorpo-rates target-directed performance attrib-utes. Thus, we characterize the EEMBCbenchmarks not only on program behavior,but also on the hardware requirementsneeded to achieve specified performancetargets. Using a combination of these ele-ments better equips designers to make

educated design choices when developingnew systems.

The EEMBC benchmark suites analyzedThe EEMBC benchmark suites are a col-

lection of algorithms and applications organ-ized into various categories to target a broadrange of users. We derive the details pre-sented here from seven of these suites: auto-motive, consumer, digital entertainment,networking, networking v2.0, office automa-tion, and telecommunication. Other suitesnot tested as of this writing include the re-cent office automation 2.0 as well as arecently released comprehensive suite fortesting multicore platforms.

Table 1 shows examples of the typesof workloads available in each suite.EEMBC has improved several suites bycreating second versions (for example,the second versions of the networkingand consumer suites are networking v2.0and digital entertainment, respectively).In this study, we test both versions whenpossible.

Some benchmarks are included in multi-ple suites. Although the tasks performed incommon benchmarks are similar, the codeis not always identical. The benchmarksuite designers place similar benchmarks inmultiple categories because some algorithmsare used in multiple embedded disciplines.For example, fast Fourier transforms are use-ful in both automotive and telecommunica-tions applications, so both suites include

Table 1. Example benchmarks from the EEMBC benchmark suites.

Suite Benchmarks

Automotive Finite and infinite impulse response (FIR and IIR) filters, tooth-to-spark

tests, pulse-width modulation, matrix multiplication and shifting, table

lookup, and fast Fourier transform (FFT)

Consumer JPEG compression and decompression, high pass gray-scale filter, RGB

to CMYK, and RGB to YIQ converter

Digital entertainment JPEG compression and decompression, high pass gray-scale filter, RGB

to CMYK, RGB to YIQ converter, Advanced Encryption Standard

(AES), and Data Encryption Standard (DES)

Networking Packet flow algorithms, Open Shortest Path First (OSPF), and route lookup

Networking v2.0 Packet check algorithms, OSPF, RSA, and network address translator (NAT)

Office automation Dithering, rotate, and text

Telecommunications Autocorrelation, FFT, and Viterbi decoder

....................................................................

SEPTEMBER/OCTOBER 2009 19

Page 3: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

the FFT algorithm. Application developersmight wish to test only the suite relevant totheir application and avoid scouring unre-lated suites for the most compatibleworkload.

Developing the methodologyTo generalize the benchmark character-

ization results, we devised a method to de-scribe application behavior independentlyof the underlying platform. We collectedEEMBC benchmark characteristics fromthe MIPS, PowerPC, ARM, PISA, andx86 architectures. This provided a mix ofinstruction set architectures (ISAs) anddifferent processor design methodol-ogies (complex-instruction-set computing,reduced-instruction-set computing, lowpower, and small size) and several cross-compilers. For all compilers, we used thedefault optimization options (�O2 orequivalent) to level the field and not givean unfair advantage to the cross-compilersthat support more optimizations. Theresulting combination of metrics providesan accurate representation of the work-load’s activity. Thus, designers can deter-mine which workloads are similar to theirown, and then leverage the most relevantbenchmarks as a proxy for architecture de-sign. The characteristics themselves assistin this design by determining the mini-mum hardware requirements needed toachieve the target performance.

We achieved characterization primarilythrough trace-driven simulation. We usedtraces to collect data from cache designexperiments and the dynamic instruction dis-tribution. We performed all cache perfor-mance experiments using a single passcache simulator that simultaneously eval-uated the performance of multiple cacheconfigurations using the least recently used(LRU) stack algorithm.7 This method is sim-ilar to the cache analysis in many previousworks.7-9 We gathered function unit require-ments by simulating an idealized MIPSsuperscalar machine. To remove all effectsother than true dependencies, the idealizedmachine assumes perfect branch prediction,a perfect cache, and infinite issue width.We tracked usage throughout execution todetermine the minimum number of function

units necessary to satisfy the function unitneeds 85 percent of the time (85 percentutilization).

It is sometimes difficult to understandwhich characteristics are inherent to theworkload, and which are microarchitecturedependent (affected by the compiler, ISA,or hardware design). Some researchers lookat characteristics that are common, assuminga specific ISA,4 whereas others use analyticalor statistical code analysis. We generatedresults using different compilers and architec-tures, and chose only characteristics that pro-duced similar results on all four architecturesand tool chains.

Figure 1 shows an example result sum-mary from the cache simulation for a singlebenchmark. The y-axis denotes the mini-mum required cache size for a given set asso-ciativity and desired miss rate. The fullresults for the benchmark include any missrate and associativity, but even this summaryshows that including a set associativity of 2,3, or 4 produces results that are clearly archi-tecture or compiler dependent. In contrast,for this benchmark, fully associative (notshown) and direct-mapped results are similarfor required miss rates of 1 or 0.1 percent.The full results therefore show that a missrate of 1 percent or better gives the mostarchitecture- and compiler-independentresults.

Caches are defined in terms of block size,set associativity, and total size. Cache perfor-mance can vary significantly if any of thesevariables change. Therefore, performing anentire design space search is often costly be-cause each cache configuration of interestrequires additional simulation. However, asingle-pass simulator simultaneously evalu-ates caches within a specified range of thesethree variables. Thus, rather than pickingparticular cache configurations, this tool iso-lates cache configurations that meet user-specified performance goals. We use thesingle-pass simulator to characterize work-loads in terms of hardware requirements,rather than simply measuring the perfor-mance of a particular cache realization. Wechose miss ratios of 1 and 0.1 percent as tar-get performance goals to indentify cache sizesthat would achieve the desired performancefor level 1 and level 2 caches, respectively.7

....................................................................

20 IEEE MICRO

...............................................................................................................................................................................................

BENCHMARKING

Page 4: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

If cold misses cause a cache configuration tohave a higher miss rate than the user-specifiedtarget, we target the intrinsic miss ratio.

When designing a new system, designersmust decide how many of each functionunit type to include. Rather than iterativelymodifying the number of function unittypes and resimulating, we perform a singlesimulation to determine the optimal distri-bution. The function unit distribution sim-ulates an idealized out-of-order machine ofinfinite width and a perfect branch predic-tor, thus making true dependencies betweeninstructions the only bottleneck. At eachclock tick, the simulator records the numberof function units of each type required toservice all in-flight instructions in the exe-cute stage. In this study, the distributionresults represent the number and type offunction units necessary to meet workloaddemands for 85 percent of the executiontime. Or, more simply, how many unitsof a given type are needed so that lack ofthose units will not limit workload execu-tion 85 percent of the time. This targetpoint allows for the best cost-benefit perfor-mance.10 For this study, we assume fivefunction unit types: integer ALUs, load/store units, multiply/divide units, branchunits, and shift units.

The analysisThe function unit distributions demon-

strate the hardware requirements needed toachieve near-maximum parallelism on anidealized machine. We visualize this datausing Kiviat graphs. Kiviat graphs plot mul-tiple variables on separate axes using a spiderplot to provide a graphical representation ofthe simulation metrics. This makes it easierto visualize multivariable data in a way thatreveals program behavior.11Figure 2 summa-rizes the function unit experiment results andgroups them by EEMBC suite to show eachsuite’s overall average tendencies. As the fig-ure shows, each suite’s behavior is unique,which is expected because they target differ-ent user audiences. This uniqueness is alsosignificant because it illustrates each suite’sapplication-specific nature and suggests thatmore than one suite must be executed tocomprehend a particular processing plat-form’s full capabilities.

The networking v2.0 suite has more func-tion unit requirements than the first version,indicating that it offers greater instruction-level parallelism. The automotive suite exhib-its similar strains on function units as thenetworking v2.0 suite; however, networkingv2.0 has slightly higher branch instructionrequirements. Therefore, computationally

0

5

10

15

20

25

S = 0 bytes S = 1 byte S = 2 bytes S = 3 bytes S = 4 bytes

Set associativity (2S) (%)

Cac

he s

ize

(2C)

(byt

es)

1 0.1 0.1 0.1 0.1 0.11 1 1 1

MIPS PPC ARM x86

Figure 1. Cache summary example for the packet check 1-Mbyte benchmark.

....................................................................

SEPTEMBER/OCTOBER 2009 21

Page 5: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

02

468IALU

LSU

IMultdivBranch

Shift c

(a)

02

468IALU

LSU

IMultdivBranch

Shift c

(b)

024

68

IALU

LSU

IMultdivBranch

Shift

(d)

024

68

IALU

LSU

IMultdivBranch

Shift

(f)

024

68IALU

LSU

IMultdivBranch

Shift

(g)

(c)

0

2

4

6

8

ALU

LSU

MultdivBranch

Shift

(e)

02

4

6

8ALU

LSU

MultdivBranch

Shift

Figure 2. Kiviat graphs of function unit distribution for each benchmark suite at 85 percent

utilization: automotive (a), consumer (b), digital entertainment (c), networking (d), networking

v2.0 (e), office automation (f), and telecom (g).

....................................................................

22 IEEE MICRO

...............................................................................................................................................................................................

BENCHMARKING

Page 6: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

the dependency chains are similar, but thebranches occur more frequently in network-ing 2.0. This behavior could be due totight loops for kernel processing, such asthe tree traversal in the route lookup bench-mark, as well as numerous conditionalclauses, such as in the packet checkbenchmark.

Higher numbers of load/store functionunits benefit all suites but telecommunica-tions. The lower requirement is not due toa lack of memory instructions, because asthe instruction distribution data in Figure 3indicates, an average number of memoryinstructions occur in the telecommunicationssuite execution stream. Rather, this numberis due to the high data interdependence inarray lookups. For example, the convolu-tional encoding algorithm uses a centralshift register, which is both updated andread within the main algorithm loop, andthe fxp bit allocation (fbital00) updates athreshold variable used for bit allocationrepeatedly within the main loop.

Our analysis found that the digital enter-tainment benchmarks, which require four ex-ecution units, are the primary stressors ofshift function unit requirements. Many

digital entertainment benchmarks—such asJPEG compression and decompression andthe Huffman decoder algorithm—use shift-ing heavily. Both the consumer and digitalentertainment suites contain similar JPEGmanipulation algorithms. Thus, the increasein shift requirements results primarily fromthe Huffman decoder benchmark. Thisbenchmark uses shifting for traversal throughthe large input buffer, and for simplifyingsome calculations into bitwise operations in-stead of slower multiplication or divisionoperations. The latter optimization createsincreased parallelism opportunities. Part ofcalculating where to look in the bufferinvolves a bitwise AND with a binary valuethat is a power of two. To calculate these val-ues, the program shifts the constant 1 left bythe appropriate variable. These calculationsdo not depend on one another, and manyvalues are constants. Therefore, these shiftsshould have no dependencies preventing in-struction parallelism, and thus the Huffmandecoder benchmark requires more functionunits to meet utilization goals.

The office automation suite exhibits aunique behavior, in which only the ALUand LSU exhibit a high degree of parallelism.

Inst

ruct

ion

dis

trib

utio

n fo

r E

EM

BC

ben

chm

arks

(%

)

0

Automotive Consumer Digitalentertainment

Networking OfficeNetworking v2.0

Telecom

20

40

60

80

100

Unknown

Special

Nop

Coprocessor operation

Branch

Stores

Loads

Multiply/divide

Shift

ALU

Figure 3. Dynamic instruction distribution.

....................................................................

SEPTEMBER/OCTOBER 2009 23

Page 7: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

Office benchmarks such as bezier and ditheruse fixed-point math extensively, and themain loop has many calculations involvingconstants and a single variable. In addition,little interdependence exists among thesestatements, so the parallelism possibilitiesare higher. Therefore, we could attributesome of the ALU and LSU parallelism tothe many parallel variable loads conjoinedwith the arithmetic calculation using aconstant.

Only the telecom suite requires many mul-tiply and divide units to achieve the utilizationgoals. The instruction distribution results inFigure 3 show that it is the only suite witha significant number of multiply and divideinstructions for several benchmarks in the ex-ecution stream. These originate from theautocorrelation benchmark (three data sets)and the fft benchmark (three data sets). Theautocorrelation benchmark, at the crux ofthe algorithm, consists of an integer multipli-cation of two array entries. Because these var-iables are unknown at compile time, thecompiler cannot make any simplifications.In addition, because this multiplication is acrucial part of the algorithm, it results in a sig-nificant number of multiply instructions. Ex-amination of the object file confirms theexistence of a mult instruction in the autocor-relation main loop that could not be opti-mized. Both the fft and aifft benchmarksfrom the automotive suite use the calcula-tion-heavy fft algorithm. As Figure 4 illus-trates, these benchmarks are similar exceptfor the branch misprediction ratio. The pri-mary difference between the code in the twoalgorithms is the structure of the conditionalclauses for the loops and if-then structuresthat control the algorithm’s flow.

Figure 2 illustrates workload variety be-tween the different benchmark suites, andfurther analysis shows significant variety in-ternal to the particular workload categories.Figure 4 examines each benchmark individu-ally by plotting their workload characteristicson Kiviat plots. As mentioned earlier, theseplots allow for a better visualization of eachbenchmark’s architectural characteristics.Table 2 denotes the meaning of each axison the Kiviat plots.

Similar workloads have similar shapes inthe Kiviat graphs. Within the figure, similar

workloads’ Kiviat graphs are grouped by vi-sual similarity. Within each group, the simi-lar shapes indicate that the relative emphasisamong characteristics is similar; however, theemphasis’s overall weight varies amongbenchmarks in the same group. As the figureshows, no single EEMBC suite exhibits com-pletely homogeneous characteristics. Eventhe consumer suite, which targets applica-tions with similar algorithms (image manip-ulation), spans two classification categories.

The most heterogeneous suite (automo-tive) covers four classification categories.For example, the FFT benchmark (aifft)has large cache requirements, whereas the in-finite impulse response filter benchmark(iirflt) does not. The pointer chase bench-mark (pntrch) exhibits a high percentage ofmemory accesses, but does not require a sig-nificant cache size to obtain the desired per-formance goals. This is unsurprising for theautomotive suite because the algorithmsand applications executing in the differentparts of a vehicle vary greatly and would re-quire different attributes for specializedprocessors.

The office automation suite is interestingin that it contains only three benchmarks,each classified into a different category. Therotate image benchmark is in a grouping(group 1) that is insensitive to increasingblock sizes, but the cache size requirementsare still similar to those of the other two of-fice benchmarks. Therefore, the rotatebenchmark stresses workloads with less spa-tial locality. The text benchmark is in agroup with a low percentage of ALU compu-tations and focuses more on variable move-ment. The main algorithm is simply a largenested switch statement that updates few var-iables, typically with constants or simple cal-culations. Finally, the dither benchmark issimilar in instruction mix and instructionsper cycle (IPC) to the rotate image bench-mark but with higher sensitivity to spatial lo-cality because increasing the block sizereduces overall cache requirements. We canattribute this higher sensitivity to the algo-rithm repeatedly indexing into a table withaddress strides between lookups of only oneor two entries.

Workloads from different suites exhibitsimilar characteristics. For example, the

....................................................................

24 IEEE MICRO

...............................................................................................................................................................................................

BENCHMARKING

Page 8: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

a2time01 A A N2 N2 N2 N2 N2 N2 N N

N

A

A

A

A A A A N

C C C O D

D D D D D D D T T

A = AutomotiveC = ConsumerD = Digital entertainmentO = OfficeN = NetworkingN2 = Networking v2.0T = Telecom

D D D D D

A A A A O T T T

A C C D D D D

2 3 4

5 6

7

8

9

O N N N2

N2

N2 N2

N2 N2

N2 A

1

canrdr01 pktcheckb512k pktcheckb1m tcpmixed pktflowb512k pktflowb1m pktcheckb2m pktcheckb4m

matrix01 rotate01 pktflowb2m pktflowb4m ospfv2 tcpbulk tcpjumbo

dither01 cjpegv2 rgbhpgv2 mp2decode

017.5

35

52.570% Memory

(1% FA B=5) 3

(0.1% FA B=5) 3

(1% FA B=7) 3

(0.1% FA B=7) 3

% ALU

IPC 20

% Branch

Branch mispredictionratio 100

% Rest

routelookupv2

aifirf01 basefp01 idctrn01 tblook01 iirflt01 text01 fft00 viterb00 nat ip_reassembly

aiifft01 aifftr01 rgbcmy01 rgbyiq01 djpegv2 rgbcmykv2 rgbyiqv2 huffde ospf

cjpeg djpeg rgbhpg01

bitmnp01

cacheb01 pntrch01 puwmod01 rspeed01 routelookup qos

conven00

aes des mp4decode mp4decodepsnr mp2enfix mp3playerfixed rsa autcor00 fbital00

mp2decodfixpsnr mp2enf32 mp2decodef32

ttsprk01

Suite key

Branchcharacterization

Other

Memory activity

Parallelism

Branchmispredictionratio 100

% Branch

IPC 20

%ALU

(0.1% FA B=7) 3

(1% FA B=7) 3

010203040506070

(0.1% FA B=5) 3

(1% FA B=5) 3

% Memory

% Rest

pntrch01_lite.exe

Figure 4. Kiviat plots of combined characteristics.

....................................................................

SEPTEMBER/OCTOBER 2009 25

Page 9: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

controller area network (CAN) remote datarequest benchmark (canrdr) takes a streamof three CAN messages and simulates theirtransmission and decoding. This process issimilar to many networking benchmarkssuch as the packet flow (pktflow) and packetcheck (pktcheck) benchmarks, which dealwith IP packet transmission and decoding.

Within the networking suites, we imple-mented the packet flow and packet checkbenchmarks using four different packetsizes. As expected, as packet size increases,the cache size requirements also increasebecause the working set is much larger.The Open Shortest Path First benchmark(ospf ) demonstrates differences between thefirst and second versions. Networking v2.0has greater ALU activity because it removessome of the superfluous debugging checksturned on by default in the first version.Both the algorithm and working set are iden-tical; the only change between versions iscode checking the consistency of the Dijkstraalgorithm, which must be the source of thedifference. Although the test harnesses be-tween the two versions differ significantly,because we only collected data within themain benchmark execution (after data setupand before tear down), it did not affect ourresults.

As Figure 2 shows, the networking suiteof benchmarks requires the fewest overallfunction units to achieve 85 percent utiliza-tion. This implies that this suite’s inherentILP is lower than that of the others. Thememory units still exhibit a decent amountof parallelism, suggesting that memorydependencies are not the main cause of the

ILP deterioration. One limiting factor isthe high dependence chains caused by ALUinstructions. Although the ALU instructiondistribution is not as high as other suites, asFigure 3 shows, some benchmarks in the net-working suite are still at least 40 percentALU instructions. Thus, only one of the Kiv-iat graphs for the networking benchmarksresides in a group exhibiting high IPC(route lookup in group 9 in Figure 4).These benchmarks are much more memoryintensive, with some benchmarks having upto 50 percent memory instructions in theiroverall execution stream.

Characterization group 3 consists entirelyof benchmarks from the consumer and digi-tal entertainment (that is, consumer v2.0)suites. These benchmarks show a low per-centage of branch instructions and a highprediction ratio. Therefore, the average IPCis high. Additionally, the graphs exhibit sen-sitivity to an increased block size, especiallywhen targeting a 1 percent miss ratio. Thisimplies many calculations with few decisionbranches in large arrays. The block sizeeffects indicate that spatial locality is signifi-cant for large portions of the data. Group 7is similar to group 3, except the benchmarksin this group have more branch instructions,implying tighter loops and more if-thenclauses. Together, the consumer and digitalentertainment groups encompass all of theconsumer benchmarks, showing that the ear-lier version contains many highly similarworkloads. The digital entertainment suiteencompasses a new characterization group(group 8) that emphasizes less cache stressand more ALU instructions.

Table 2. Description of axes for Figure 4.

Axis label Meaning

(1% FA B=5) � 3

(0.1% FA B=5) � 3

(1% FA B=7) � 3

(0.1% FA B=7) � 3

Log2(C), where C is the cache size in bytes required for a fully associative

cache with a block size of 2B targeting a (1%/0.1%) miss ratio. We multiply

this value by 3 to normalize it with other metrics.

% ALU/% Memory/% Branch/% Rest Dynamic instruction distribution of ALU, memory operations, branches, and all

other instruction types.

Interprocess communication (IPC) � 20 The IPC calculated on a 4-wide superscalar processor using a bimodal branch

predictor. We multiply this value by 20 to normalize it with other metrics.

Branch misprediction ratio � 100 Bimodal branch predictor miss ratio, which we multiply by 100 to normalize

it with other metrics.

....................................................................

26 IEEE MICRO

...............................................................................................................................................................................................

BENCHMARKING

Page 10: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

Each grouping exhibits unique character-istics, which illustrates diversity in andamong the EEMBC suites. For example,group 6 contains networking benchmarksthat stress only the cache by reducing com-plex calculations and benchmarking largearray lookups and data movement. However,all groups share a high misprediction rateusing a simple bimodal predictor. Of course,this might not be an issue for most embeddedapplications because branch predictors oftenare not included or needed because thecode is optimized to be fast without them.

Characterization group 8 encompassesbenchmarks with higher ALU activity thanother characterization metrics. Most of thebenchmarks in this category also exhibitsomewhat lower IPC values than othergroupings, yet the branch mispredictionratio is consistently low. The cache sizerequirements are relatively small comparedto other groupings because the working setis smaller for these benchmarks. Therefore,the implicit ILP must be limited to other fac-tors such as a high degree of dependentinstructions and calculations.

Characterization group 4 consists of a sin-gle benchmark, ospf. This benchmark is sim-ilar to the route lookup benchmark in group9, except that the latter is less memory inten-sive and exhibits a higher branch mispredic-tion ratio. Group 9 benchmarks respond toan increasing block size, whereas ospf doesnot. This is because ospf spends a largeamount of time traversing a graph to solveDijkstra’s algorithm, in which the workingset is larger and not sequentially stored,whereas benchmarks such as group 9’sroute lookup spend more time in computa-tion through sequential structures.

Characterization group 5 is similar togroup 1, except that it exhibits slightly differ-ent cache behavior. In this grouping, thecache requirements differ for 1 and 0.1 per-cent miss ratio targets. This implies that onlya few conflict misses cause most of the cachemisses for benchmarks in this category.Targeting a 0.1 percent miss ratio whenthe block size is 32 bytes requires a largercache to map these conflict misses into sepa-rate blocks, whereas in group 1 the cacherequirements for both cache sizes are similar.However, as benchmarks like the viterbi

decoder benchmark (viterb00) illustrate,this behavior is not identical for the largerblock size. In addition to this behavior,group 5 benchmarks exhibit relatively fewercache requirements than the other metrics.

Group 1 exhibits the most balanceamong metrics. Benchmarks in this classifi-cation provide an overall equal stress on thecache and balanced instruction distribu-tion, with varying IPC results. Most ofthe benchmarks place high stress on thecache, particularly those in the networkingsuites. One benchmark that is balancedoverall but still different in this categoryis the automotive suite’s angle-to-time con-version (a2time) benchmark. The cacherequirements are smaller than those ofother benchmarks, and the branch mispre-diction ratio is very low, yet the IPC is notvery high. Most benchmarks with thiscache behavior trend toward category 8or 9, with either an imbalance in the in-struction distribution or a higher branchmisprediction ratio. Yet, this benchmarkdoes not exhibit a spike for either character-istic. The benchmark is essentially a series ofsequential calculations building on one an-other and causing true dependencies be-tween instructions to limit ILP. Also, thereis a sequence of if statements not groupedas if-then-else clauses, yet each one updatesthe same variable, resulting in outputdependencies that can cause lower IPC.

I n future work, we could use other toolsand methodologies4,5 to generate differ-

ent characteristics and compare the results.In particular, Hoste and Eeckhout examinethe delta change of measurements at fixedintervals to expose time-varying behavior.4

Our work has focused on characterizingthe benchmarks, but it would also be inter-esting and practical to apply this methodol-ogy to real application code for validation.We could then devise a scheme that wouldcompare workloads between the benchmarksand the real code, thereby letting the devel-oper select benchmarks that most closely re-semble their applications.

Another interesting approach is exploringthe generation of workloads with specificcharacteristics to evaluate a particular archi-tecture. Some work on the subject exists,12-14

....................................................................

SEPTEMBER/OCTOBER 2009 27

Page 11: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

but there appears to be no solution for gen-erating portable C code with specific charac-teristics. Using a known set of kernels withvarious data sets and composing them intoa benchmark based on characteristics aswell as their relevance to a specific marketsegment can enable the creation of more ac-curate application-representative workloads.EEMBC is pursuing such a model for a fu-ture version of the automotive benchmarks.

You might ask, ‘‘Why go through all theeffort? Why don’t developers just use theirown application code to benchmark?’’ Theanswer is simple: developers cannot alwaysrun their code on all platforms without con-siderable porting effort, and, for large appli-cations, huge time commitments. Thebenchmarks serve as proxies for larger work-loads that are representative, portable, andquick to execute. Additionally, the value ofan industry standard and test input consis-tency allow a more uniform comparison.Results for EEMBC benchmarks are avail-able from any silicon provider on request,and some are available publicly at www.eembc.org. Using these results and armedwith specific benchmarks’ characteristics,users can choose those results they find usefuland select suitable systems or methodologiesto use when designing new products, withoutspending time and effort to port and run thecode on a platform. M I CR O

....................................................................References

1. T.M. Conte and W.W. Hwu, ‘‘Benchmark

Characterization,’’ Computer, vol. 24, no. 1,

Jan. 1991, pp. 48-56.

2. A. Joshi et al., ‘‘Distilling the Essence of

Proprietary Workloads into Miniature

Benchmarks,’’ ACM Trans. Architecture

and Code Optimization (TACO), vol. 5, no. 1,

2008, article 10.

3. T.S. Karkhanis and J.E. Smith, ‘‘Automated

Design of Application Specific Superscalar

Processors: An Analytical Approach,’’

Proc. 34th Ann. Int’l Symp. Computer Ar-

chitecture (ISCA 07), ACM Press, 2007,

pp. 402-411.

4. K. Hoste and L. Eeckhout, ‘‘Microarchi-

tecture-Independent Workload Character-

ization,’’ IEEE Micro, vol. 27, no. 3, May

2007, pp. 63-72.

5. A. Joshi et al., ‘‘Measuring Benchmark Simi-

larity Using Inherent Program Characteris-

tics,’’ IEEE Trans. Computers, vol. 55, no. 6,

June 2006, pp.769-782.

6. K. Hoste et al., ‘‘Performance Prediction

Based on Inherent Program Similarity,’’

Proc. 15th Int’l Conf. Parallel Architectures

and Compilation Techniques (PACT 06),

ACM Press, 2006, pp. 114-122.

7. T.M. Conte, M.A. Hirsch, and W.-M.W Hwu,

‘‘Combining Trace Sampling with Single Pass

Methods for Efficient Cache Simulation,’’

IEEE Trans. Computers, vol. 47, no. 6, June

1998, pp. 714-720.

8. P. Viana et al., ‘‘A Table-Based Method for

Single-Pass Cache Optimization,’’ Proc.

18th ACM Great Lakes Symp. VLSI, ACM

Press, 2008, pp. 71-76.

9. A. Silva et al., ‘‘Cache-Analyzer: Design

Space Evaluation of Configurable-Caches

in a Single-Pass,’’ Proc. 18th IEEE/IFIP Int’l

Workshop Rapid System Prototyping (RSP

07), IEEE CS Press, 2007, pp. 3-9.

10. T.M. Conte, ‘‘Architectural Resource Re-

quirements of Contemporary Benchmarks:

A Wish List,’’ Proc. 26th Hawaii Int’l Conf.

System Sciences (HICSS 93), IEEE CS

Press, 1993, pp. 517-529.

11. K.W. Kolence and P.J. Kiviat, ‘‘Software

Unit Profiles & Kiviat Figures,’’ ACM

SIGMETRICS Performance Evaluation Rev.,

vol. 2, no. 3, 1973, pp. 2-12.

12. R. Bell Jr. and L. John, ‘‘Improved Automatic

Test Case Synthesis for Performance Model

Validation,’’ Proc. Int’l Conf. Supercomputing

(ICS 05), ACM Press, 2005, pp. 111-120.

13. Z. Kurmas et al., ‘‘Synthesizing Representa-

tive I/O Workloads Using Iterative Distilla-

tion,’’ Proc. Int’l Symp. Modeling, Analysis,

and Simulation of Computer and Telecom-

munication Systems (MASCOTS 03), IEEE

CS Press, 2003, pp. 6-15.

14. A.M. Joshi, ‘‘Automated Microprocessor

Stressmark Generation,’’ Proc. IEEE 14th

Int’l Symp. High Performance Computer Ar-

chitecture (HPCA 08), IEEE CS Press, 2008,

pp. 229-239.

Jason A. Poovey is a PhD student in theCollege of Computing at the GeorgiaInstitute of Technology. His research inter-ests include performance modeling, work-load characterization, and manycore/multi-

....................................................................

28 IEEE MICRO

...............................................................................................................................................................................................

BENCHMARKING

Page 12: ENCHMARK CHARACTERIZATION OF THE EEMBC BENCHMARK UITEprod.tinker.cc.gatech.edu/pdfs/poov_benchmark.pdf · abenchmark characterization of the eembc benchmark suite benchmark consumers

core technology. Poovey has a master’s degreein computer engineering from North CarolinaState University.

Markus Levy is the president and founderof the Embedded Microprocessor Bench-mark Consortium (EEMBC), and founderof the Multicore Association. His researchinterests include microprocessor technol-ogy, processor benchmarking, and multi-core programming. He received severalpatents while at Intel for his ideas relatedto flash memory architecture and usage as adisk drive alternative. Levy has a BS inelectrical engineering from San FranciscoState University.

Shay Gal-On is the director of softwareengineering at the Embedded Microproces-sor Benchmark Consortium and leader ofthe EEMBC Technology Center. His inter-ests include performance analysis, softwaresecurity, and compilers. Gal-On received hisdegree from the Technion in Israel, follow-ing which he worked on compilers andperformance tools for Intel, Improv Systems,

PMC, and now EEMBC. Contact Gal-Onat [email protected].

Thomas M. Conte is a professor in theCollege of Computing at the GeorgiaInstitute of Technology. His research is inthe areas of manycore/multicore architectures,microprocessor architectures, compiler codegeneration, architectural performance evalua-tion, and embedded computer systems. Contehas a PhD in electrical engineering from theUniversity of Illinois at Urbana Champaign.He is associate editor of ACM Transactions onEmbedded Computer Systems, ACM Transac-tions on Architecture and Compiler Optimiza-tion, Computer, and IEEE Micro.

Direct questions and comments aboutthis article to Jason Poovey, College ofComputing, Georgia Inst. of Technology,266 Ferst Dr., KACB 2334, Atlanta, GA30332; [email protected].

....................................................................

SEPTEMBER/OCTOBER 2009 29