Can Hardware PerformanceCounters Be Trusted?
Vincent M. Weaver and Sally A. McKee
Cornell University
16 September 2008
Motivation
• Gather Basic Block Vectors for SimPoint
• Attempt to validate
• Found variation
1
Fusion
Hardware Performance Counters
• Available on all modern processors
• Used for validation
• Used for performance and workload characterization
Can they be trusted?
2
Fusion
Related Work
• Black, Huang, Lipasti, Shen. ICCD 1996
PowerPC, Shorter benchmarks
• Korn, Teller, Castillo. IPCCC 2001
MIPS, up to 25% error compared to sim
• Maxwell, Teller, Salayandia, Moore. LACSIS 2002
Pentium III, <1% error, microbenchmarks
• Mytkowicz, Diwan, Hauswirth, Sweeney. NSF-NGS 2008
VM effects can cause up to 5% run-time variation
3
Fusion
Retired Instruction Count
• Universally available
• Should be same on all implementations of ISA
• Used extensively with sampled execution (SimPoint, etc.)
• Part of IPC/CPI metrics
4
Fusion
Experimental Setup
• Linux 2.6.25.4 with perfmon2 patches
• Statically linked 32-bit SPEC CPU 2000 and 2006, full
reference inputs
• Nine systems, seven runs per benchmark per system,
48 SPEC 2000, 55 SPEC 2006
• Only userspace instructions counted
• Pin, Qemu and Valgrind DBI tools also investigated
5
Fusion
Experimental Systems
Processor Speed Memory L1 I/D L2 Cache
Pentium Pro 200MHz 256MB 8KB/8KB 512KB
Pentium II 400MHz 256MB 16KB/16KB 512KB
Pentium III 550MHz 512MB 16KB/16KB 512KB
Pentium 4 2.8GHz 2GB 12Kµ/16KB 512KB
Pentium D 3.46GHz 4GB 12Kµ/16KB 2MB
Athlon XP 1.73GHz 768MB 64KB/64KB 512KB
Phenom 2.2GHz 2GB 64KB/64KB 512KB
Core Duo 1.6GHz 1GB 32KB/32KB 1MB
Core2 Q6600 2.4GHz 2GB 32KB/32KB 4MB
6
Fusion
Sources of Variation
• We found sources of variation:
◦ Inconsistent Instructions
◦ Virtual Memory Layout
◦ Hardware Effects
◦ System Issues
◦ DBI/Simulator Differences
• Can these be mitigated?
7
Fusion
fldcw Instruction
• Pentium 4 instr retired:nobugsntag counts fldcwas two instructions
• Can lead to a large overcount; 177.mesa has additional
7 billion dynamic instructions, an overcount of 2.4%
• 12 of the benchmarks have overcount of at least 100M
Mitigation:• Use instr completed:nbogus (Pentium D)
• Adjust using DBI-collected fldcw count
8
Fusion
x86 32-bit Virtual Memory Layout
Operating System
Stack
Heap
BSS
Data
Text (Executable)
0x0000 0000
0xffff ffffOperating System
Exe Name
Cmd Line Args
Env Vars
Stack
0xbfff ffff
9
Fusion
Virtual Memory Randomization
Operating System
BSS
Data
Text (Executable)
0x0000 0000
0xffff ffffOperating System
Exe Name
Cmd Line Args
0xbfff ffff
Heap
Stack
Env Vars
Stack
Stack/HeapRandomization
10
Fusion
Stack Offset Changes
Operating System
Stack
Heap
BSS
Data
Text (Executable)
0x0000 0000
0xffff ffffOperating System
0xbfff ffff
Stack
Cmd Line Args
Exe Name
Env Vars
Environment VariablesCommand LineExectuable Name
11
Fusion
64-bit Compatibility
Heap
BSS
Data
Text (Executable)
0x0000 0000
Exe Name
Cmd Line Args
Env Vars
Stack
Stack0xffff ffff
64−bitCompatibility
12
Fusion
Virtual Memory Layout Impact
• Pointers as hash table keys:
◦ Heap — parser
◦ Stack — perlbench
• Optimized memory copies
Mitigation:
• linux32 -3 -R — enforce VM, disable randomization
• /proc/sys/kernel/randomize va space
• Enforce environment variable size
13
Fusion
Hardware Effects
• Processor Errata
• Hardware Interrupts — cause extra counts
Mitigation:
• Be aware of errata
• Count or estimate interrupts
14
Fusion
Operating System Effects
• Non-deterministic system calls: time, PID, thread
synchronization, random numbers, network activity, IO
• Page faults
Mitigation:
• Modify benchmarks, use methods to reduce non-
determinism
• Count pagefaults
15
Fusion
DBI Tool/Simulator Issues
• Instruction complexity: rep prefix
• Floating point rounding issues (art, dealII)
• Virtual Memory Layout
Mitigation:
• Fix simulator/DBI tool
• VM — same as with real hardware
16
Fusion
Results
Two kinds of variation:
• Inter-machine (differences between systems)
• Intra-machine (differences on the same machine)
17
Fusion
Coefficient of Variation – SPEC CPU 2000
Original Standard Deviation Updated Standard Deviation
256.bzip2.graphic
256.bzip2.program
256.bzip2.source
186.crafty.default
252.eon.cook
252.eon.kajiya
252.eon.rushmeier
254.gap.default
176.gcc.166
176.gcc.200
176.gcc.expr
176.gcc.integrate
176.gcc.scilab
164.gzip.graphic
164.gzip.log
164.gzip.program
164.gzip.random
164.gzip.source
181.mcf.default
197.parser.default
253.perlbmk.535
253.perlbmk.704
253.perlbmk.957
253.perlbmk.850
253.perlbmk.diffmail
253.perlbmk.makerand
253.perlbmk.perfect
300.twolf.d
efault
255.vortex.1
255.vortex.2
255.vortex.3
175.vpr.place
175.vpr.route
1
0.001
1e-6
1e-9Coe
ffici
ent o
fV
aria
tion
(log)
188.ammp.default
173.applu.default
301.apsi.default
179.art.110
179.art.470
183.equake.default
187.facerec.default
191.fma3d.default
178.galgel.default
189.lucas.default
177.mesa.default
172.mgrid.default
200.sixtrack.default
171.swim.default
168.wupwise.default
1
0.001
1e-6
1e-9Coe
ffici
ent o
fV
aria
tion
(log)
1.07%
18
Fusion
Coefficient of Variation – SPEC CPU 2006
Original Standard Deviation Updated Standard Deviation
473.astar.BigLakes
473.astar.rivers
401.bzip2.chicken
401.bzip2.combined
401.bzip2.html
401.bzip2.liberty
401.bzip2.program
401.bzip2.source
403.gcc.166
403.gcc.200
403.gcc.c-typeck
403.gcc.cp-decl
403.gcc.expr
403.gcc.expr2
403.gcc.g23
403.gcc.s04
403.gcc.scilab
445.gobmk.13x13
445.gobmk.nngs
445.gobmk.score2
445.gobmk.trevorc
445.gobmk.trevord
464.h264ref.foreman_baseline
464.h264ref.foreman_main
464.h264ref.sss_main
456.hmmer.nph3
456.hmmer.retro
462.libquantum.default
429.mcf.default
471.omnetpp.default
400.perlbench.checkspam
400.perlbench.diffmail
400.perlbench.splitmail
458.sjeng.default
483.xalancbmk.default
1
0.001
1e-6
1e-9Coe
ffici
ent o
fV
aria
tion
(log)
410.bwaves.default
436.cactusADM.default
454.calculix.default
447.dealII.default
416.gamess.cytosine
416.gamess.h2ocu2
416.gamess.triazolium
459.GemsFDTD.default
435.gromacs.default
470.lbm.default
437.leslie3d.default
433.milc.default
444.namd.default
453.povray.default
450.soplex.pds-50
450.soplex.ref
482.sphinx3.default
465.tonto.default
481.wrf.default
434.zeusmp.default
1
0.001
1e-6
1e-9Coe
ffici
ent o
fV
aria
tion
(log)
0.41%
19
Fusion
Same Machine Results – SPEC CPU 2000
Pentium Pro
Pentium II
Pentium III
Pentium 4
Pentium D
Athlon XP
Phenom 9500
Core Duo
Core2 Q6600 PinQemu
Valgrind
010010k1M
100M
-100-10k-1M
-100M
Diff
eren
ce fr
omM
ean
(log)
Diff
eren
ce fr
omM
ean
(log)
pppp pppp
s
ppppp
ppppp pppp pppp
pppp c
pppp aae
fpppp pppp
ppppp
pppp
Original Standard Deviation Updated Standard Deviation
aa
: applu: apsi
ce
: crafty: equake
fp
p : swim: perlbmk
s
: non−outlying benchmarks
: facerec: parser
20
Fusion
Same Machine Results – SPEC CPU 2006
Pentium Pro
Pentium II
Pentium III
Pentium 4
Pentium D
Athlon XP
Phenom 9500
Core Duo
Core2 Q6600 PinQemu
Valgrind
010010k1M
100M
-100-10k-1M
-100M
Diff
eren
ce fr
omM
ean
(log)
Diff
eren
ce fr
omM
ean
(log)
p mp
pp
pp ppp ppp p hhhlmox
bccdgGglmstwz
ps g
l
p
zg
pss p p
p
Original Standard Deviation Updated Standard Deviation
: sjenggg : gromacs
: gcc.expr2
l : leslie3d: non−outlying benchmarks
m : milc
: povrayp : perlbenchp
ss : soplex
: zeusmpz
21
Fusion
Cross Machine Results – SPEC CPU 2000
256.bzip2.graphic
252.eon.cook
197.parser.default
187.facerec.default
177.mesa.default
010010K1M
100M10B
-100-10K-1M
-100M-10B
Diff
eren
ce fr
omM
ean
(log)
Diff
eren
ce fr
omM
ean
(log)
6
6 6
6
6
2
2 2
2
233 3
3
3
4
4
44
4
D
D D
D
D
A
A
A
A
A
9
9 9
9
9CC C
CC
T
T T
T
T
P
PP
P
P
Q
Q
Q
V
V V
V
V
Original Standard Deviation Updated Standard Deviation
623
: Pentium Pro: Pentium II: Pentium III
4DA
PQVT
: Qemu: Pentium 4: Pentium D: Athlon XP : Core2 Q6600
: Pin
: ValgrindC : Core Duo9 : Phenom 9500
22
Fusion
Cross Machine Results – SPEC CPU 2006
401.bzip2.liberty
403.gcc.scilab
456.hmmer.retro
483.xalancbmk.default
482.sphinx3.default
010010K1M
100M10B
-100-10K-1M
-100M-10B
Diff
eren
ce fr
omM
ean
(log)
Diff
eren
ce fr
omM
ean
(log)
3
3 3
3
3
4 44
4 4
D
D
DD
D
A
A
A AA
9
9
99
9C
C
C
C
C
T
T
TT
T
P P PP
P
Q Q QQ
Q
V
V
V
VV
Original Standard Deviation Updated Standard Deviation
623
: Pentium Pro: Pentium II: Pentium III
4DA
PQVT
: Qemu: Pentium 4: Pentium D: Athlon XP : Core2 Q6600
: Pin
: ValgrindC : Core Duo9 : Phenom 9500
23
Fusion
Conclusion
• The retired instruction counter can be trusted to have
low variation both inter- and intra-machine
• These results hold across processor generations
• For best results, precautions must be taken
24
Fusion
Future Work
• Track down remaining sources of variation
• Non-x86 platforms
• Investigate other counter types
• Parallel workloads
25
Fusion
Tools
All code is available from our tools page:
http://fusion.csl.cornell.edu/tools/
26
Fusion
Questions?
All code is available from our tools page:
http://fusion.csl.cornell.edu/tools/
27
Fusion