Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
HotChips 2002 - Intel® Itanium® 2 Processor
A Brief Analysis of the SPECCPU2000 Benchmarks on theIntel® Itanium® 2 Processor
James McCormick, HPAllan Knies, Intel
All trademarks are the property of their respective owners
HotChips 2002 - Intel® Itanium® 2 Processor
Agenda
This is a data-oriented presentation, not research
Agenda– Brief performance summary– Comparison of how HP and Intel compilers use
Itanium® architecture features and compare to bestRISC.
– Analysis of microarchitectural features of the Intel®Itanium® 2 processor and how it affects SPECCPU2000 performance
HotChips 2002 - Intel® Itanium® 2 Processor
Overall Performance
SPEC{int/fp}_base2000 810/1356 (best 0.18u)Linpack 1000: 3.5 Gflops (best overall)TPC-C (SQL/4P): 78K tpmC (best 4P number)SPECweb_SSL: 1520 connections (best of class)
! Itanium® 2 processor best of class on a wide range of applications
SPEC CPU2000 Results
0200400600800
1000120014001600
SPECint_base2000 SPECint_peak2000 SPECfp_base2000 SPECfp_peak2000
results from www.spec.org
Itanium2/HP 1Ghz IBM Power4 1.3 Ghz Alpha 21264 1Ghz UltraSPARC IIICu 1.05 Ghz
HotChips 2002 - Intel® Itanium® 2 Processor
Part I: ISA and CompilerComparisons
Allan Knies
Itanium® Architecture and PerformanceTeam
Intel [email protected]
HotChips 2002 - Intel® Itanium® 2 Processor
• Number of ‘useful instructions’ (shown in blue) +/- 1% of Alpha• Total instructions (blue+green) +20-30% of Alpha (due to NOPs)• Bundling is main cause of extra NOPS! We’ll see that extra instr/nops are not substantially impacting perf
Instruction Mix: High Lev el
0.0E +00
5.0E +10
1.0E +11
1.5E +11
2.0E +11
2.5E +11
3.0E +11
3.5E +11
4.0E +11
Itanium 2 /Inte l Itanium 2 /HP A lpha
Tota
l Ins
truc
tions
B ranch or Not P redic ated P redicate True P redicate Fals e Nop
HotChips 2002 - Intel® Itanium® 2 Processor
Instruction Mix Details IntroductionItanium/Alpha ISAs:• Itanium® arch has 40% fewer memory operations and 30% fewer
branches than Alpha (some impact from no-pgo on Alpha)• Itanium arch has about 10% more ALU/compares/shifts than Alpha• Itanium arch has about 20-30% NOPs – eventually expect this to be
under 20%
HP/Intel Compiler:• HP compiler uses more memory and ALU ops than Intel• Implies HP more conservative with registers – we’ll see impact later
! Itanium® architecture trades more ‘easy’ instructions (NOP, alu,cmp) for reducing the ‘hard’ instructions (branch, load)
! More than one way to get good performance from a compiler
HotChips 2002 - Intel® Itanium® 2 Processor
• Alpha has 1.4x mem refs of Itanium® architecture (incl/RSE ops)• RSE ops are easy to optimize in the future, if needed• HP compiler is more consv with regs, but has more ALU/reloads
! Large register file, good compiler technology, and RSE pay off! HP has lower RSE costs, but more reloads/alu ops
Instruction Mix: Memory Instruction Details
0.0E+001.0E+102.0E+103.0E+104.0E+105.0E+106.0E+107.0E+108.0E+109.0E+101.0E+11
Itanium2/Intel Itanium2/HP Alpha
Mem
ory
Inst
ruct
ions
RSEStoresLoads
HotChips 2002 - Intel® Itanium® 2 Processor
• Total time spent in RSE is only 3%-4% of overall execution• 1½ -3 cycles per call/return for RSE spill/fill activity• 1-3 instructions per subroutine setup for RSE
• Intel compiler has fewer calls, but more cycles/call
! Register stack provides very low overhead call/return support
Features: Register S tack
0 .0E +00
2 .0E +09
4 .0E +09
6 .0E +09
8 .0E +09
1 .0E +10
Inte l HP
RS E
Tota
l RSE
Cyc
les
Total RS E C y c les
HotChips 2002 - Intel® Itanium® 2 Processor
• HP compiler generates 13% fewer br’s than Intel, 9% more mispredictions
• HP compiler generates 31% fewer branches than Alpha
• Itanium-based binaries fewer tk br’s than Alpha, data skewed by lack of PGO
! Itanium® architecture reduces the # branches and branch mispredicts! HP/Intel compilers both reduce # branches – but with different focus
Instruction M ix: Branch Detail
1 .389E +10 1 .282E +10
2 .42E +10
1 .895E +101 .624E +10
1 .383E +10
0 .0E +00
5 .0E +09
1 .0E +10
1 .5E +10
2 .0E +10
2 .5E +10
3 .0E +10
3 .5E +10
4 .0E +10
Itanium 2/Inte l Itanium 2/HP A lpha
Bra
nch
Inst
ruct
ions
Not Taken B ranchesTaken B ranches
HotChips 2002 - Intel® Itanium® 2 Processor
• Useful instructions per taken branch is very high• Useful instructions per mispredicted branch slightly better for Intel• Useful instructions/call very high – Intel/HP compilers very aggr inlining
! Advanced compilers reduce stress on br prediction/Icache HW! Trading Istream size for regularity improves HW efficiency
Parallelism: Region Size
19
15 9
21 2
20
14 115 8
0
50
10 0
15 0
20 0
25 0
U s e fu l In s tr/Tk B r U s e fu l In s t/MP B ra n ch U s e fu l In s t/C a ll
Use
ful I
nstr
uctio
ns
Inte lHP
HotChips 2002 - Intel® Itanium® 2 Processor
• About 20-30% of loads are speculative in Intel binaries• Data shows tiny penalty for chk.s usage despite high usage rate• Intel has 10x more chk.s than HP, HP uses ‘no recovery model’selectively (per benchmark decision)• HP has 10x more chk.a/ld.c then Intel, recovery less than 1% time
! Speculation heavily used, but causes little overhead
Features: Speculation
1.0E+00
1.0E+02
1.0E+04
1.0E+06
1.0E+08
1.0E+10
Intel HP Intel HP
chk.s chk.a and ld.c
Inst
ruct
ions
# Checks # Failed
HotChips 2002 - Intel® Itanium® 2 Processor
• Useful IPC computed using ‘unstalled IPC’• Compilers find 2.5-3.0 IPC in integer apps (even beyond SPEC)• Dynamic delays reduce this to 1.3 achieved for CPU2000 integer
! Differences in perf/heuristics shows headroom for both compilers! Good IPC found by both compilers, room for uArch improvements
Useful IPC
0 .000 .501 .001 .502 .002 .503 .003 .504 .00
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
Inte
lH
P
g zip vp r gcc m cf cra fty pa rse r eo n pe rl ga p vo rtex bzip2 two lf A vg
Use
ful I
PC
A chie ved F o und
HotChips 2002 - Intel® Itanium® 2 Processor
Notes
• Itanium-based binaries used for these stats are older than thoseused for the official SPEC submission (less than 10% difference)
• The results for Intel® Itanium® 2 processor in Part I are: one withthe Intel compiler running on 64-bit MS OS and another with theHP compiler running under HP-UX
• Alpha ISA numbers via simulation, binaries used near peak (noprofile guided optimization), tuned for 21264. Alpha data missingVPR and PERL – thus, left out of all averages in Part I.
• Results computed with arithmetic averages – data thus skewedtowards long-running benchmarks
HotChips 2002 - Intel® Itanium® 2 Processor
Part II: Microarchitecture
James McCormickHP Colorado VLSI Laboratory
HotChips 2002 - Intel® Itanium® 2 Processor
Cycle Accounting: INT
7%
38%
3%1%
49%
2%
PIPE FLUSH DATA ACCESSINSTR ACCESS SCOREBOARDRSE UNSTALLED
Cycle Accounting: FP
1%
47%
1%
0%
50%
1%
PIPE FLUSH DATA ACCESSINSTR ACCESS SCOREBOARDRSE UNSTALLED
!Remaining: unstalled execution and data access!Great performance
HotChips 2002 - Intel® Itanium® 2 Processor
Instructions Per Unstalled Cycle
0123456
Integer Floating Point
IPC
USEFUL w/ PRED OFF + NOPS
!High machine parallelism
HotChips 2002 - Intel® Itanium® 2 Processor
Total Instruction Fetch Latency
0
20
40
60
80
100
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vorte
x
256.
bzip
2
300.
twolf
AVER
AGE
168.
wup
wise
171.
swim
172.
mgr
id
173.
appl
u
177.
mes
a
178.
galg
el
179.
art
183.
equa
ke
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
200.
sixt
rack
301.
apsi
AVER
AGE
Integer Floating Point
Perc
ent
L1I HIT L2 INSTR HIT L3 INSTR HIT MEM INSTR ACCESS
!Noticeable L1I misses!Very small I-fetch component
HotChips 2002 - Intel® Itanium® 2 Processor
!High FE throughput rate! I-miss latency hidden
Instruction Fetch Ahead
0
1
2
3
4
5
6
INT AVERAGE FP AVERAGE
Inst
ruct
ions
Per
Cyc
le
0102030405060708090100
Perc
ent E
xpos
ed
DECOUPLED DISP THROUGHPUT DECOUPLED FE THROUGHPUTFE BUBBLES EXPOSED
HotChips 2002 - Intel® Itanium® 2 Processor
Branch M ispredictions
37.7
02468
101214161820
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vorte
x
256.
bzip2
300.
twolf
AVER
AGE
Integer
Mis
pred
ictio
n Pe
rcen
tage
02468101214161820
Full
Pipe
lines
Per
Fl
ush
WRO NG PAT H WRO NG T ARG ET UNK NO WN FULL P IPEL INES PER M IS PRED B R
!High accuracy, low penalty!Helps instruction fetch
HotChips 2002 - Intel® Itanium® 2 Processor
Total Data Read Latency
0102030405060708090
100
164.
gzip
175.
vpr
176.
gcc
181.
mcf
186.
craf
ty
197.
pars
er
252.
eon
253.
perlb
mk
254.
gap
255.
vorte
x
256.
bzip
2
300.
twol
f
AVER
AGE
168.
wup
wis
e
171.
swim
172.
mgr
id
173.
appl
u
177.
mes
a
178.
galg
el
179.
art
183.
equa
ke
187.
face
rec
188.
amm
p
189.
luca
s
191.
fma3
d
200.
sixt
rack
301.
apsi
AVER
AGE
Integer Floating Point
Perc
ent o
f Lat
ency
L1D DATA READ L2 DATA READ L3 DATA READ MEM DATA READ
!Large component, large opportunity
HotChips 2002 - Intel® Itanium® 2 Processor
SummaryItanium® 2 Processor Delivers Leadership
Performance
– Architecture / Compilers• Expose lots of ILP to in-order pipeline• Replace difficult instructions with easy ones• RSE and large register file work well together
– CPU Design• Machine parallelism handles high ILP• Aggressive design reduces bottlenecks
HotChips 2002 - Intel® Itanium® 2 Processor
AcknowledgementsJason Cantin (U. Wisconsin):
Ran all the experiments to gather the Alpha ISA data withchanges/alterations at our request. Without Jason, we would have noAlpha data of any kind. Thanks to his efforts to rerun all the data forus!
Hwansoo Han, Youngsoo Choi, Geetha Vederaman (Intel):Intel data gathering, formatting, methodology, performance monitoring
Bryan Black, Ed Grochowski, Jim Callister, Carole Dulong (Intel):Extensive draft review and comments
Caliper Team (HP):Tool support
HotChips 2002 - Intel® Itanium® 2 Processor
Backup/Reference Slides
HotChips 2002 - Intel® Itanium® 2 Processor
Configuration InformationIntel® Itanium® 2 processor data for Intel systems/compilers:
– Binaries: Pre-production version of Intel C++ 6.0 Compiler, -O3 with interproceduraloptimization and profile guided optimization
– Run on: Pre-production stepping of Itanium 2 processor 800Mhz/200Mhz core/bus, Intel 870chipset, monitor kernel and user instructions and events
Intel® Itanium® 2 processor data for HP systems/compilers:– Binaries: Pre-production version of HP compiler– Run on: Pre-production stepping of Itanium 2 processor 1000Mhz/200Mhz core/bus, rx2600
prototype, monitor kernel and user instructions and eventsAlpha ISA data:
– Run on: Functional simulator, system code not simulated– Simulation details: http://www.cs.wisc.edu/multifacet/misc/spec2000cache-data/– All data thanks to Jason Cantin at the University of Wisconsin.– Binaries: http://www.eecs.umich.edu/~chriswea/benchmarks/Com.cf
In general, these binaries are optimized for 21264, peak optimization. Usually, –g3 –fast –O4,but NO profile feedback. Compiler: DEC C V5.9-005 and DI GI T AL C++ V6.1-027
Other remarks:– All averages in slides left out PERL and VPR due to data not available for Alpha