View
214
Download
2
Category
Tags:
Preview:
Citation preview
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm
Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian
Watson, and Marcelo CintraUniversity of Edinburgh
http://homepages.inf.ed.ac.uk/mc/
Projects/VESPA
University of Manchesterhttp://intranet.cs.man.ac.uk/
apt/projects/iTLS
Intl. Symp. on Workload Characterization - December 2010 2
Introduction
Thermal/power constraints, complexity and time-to-market reasons lead to CMPs
Many simple cores = high TLP but low ILP– Ok for throughput computing and
embarrassingly parallel applications Problem:
– No benefits for sequential applications– Parallel applications with large sequential
parts are still limited by Amdahl => Thread Level Speculation (TLS)
Intl. Symp. on Workload Characterization - December 2010 3
Modivation
Shortcoming of prior work in assessing TLS performance potential
– Evaluations often tied to particular TLS architectural configuration
– Proposals of new extensions naturally focused on particular extensions not investigating interplay with other features
– Workload choice often limited to one particular domain or programming style
Intl. Symp. on Workload Characterization - December 2010 4
Contributions
In-depth implementation-independent study of TLS performance potential
Evaluate TLS architectural features
Evaluate workloads from a variety of domains
Investigate load imbalance and coverage within the context of TLS
Intl. Symp. on Workload Characterization - December 2010 5
Outline
Introduction Background Methodology Results Conclusions
Intl. Symp. on Workload Characterization - December 2010 6
Thread Level Speculation
Compiler deals with:– Task selection– Code generation
HW deals with:– Different context– Spawn threads– Detecting violations– Replaying – Arbitrate commit
Thread 1
Thread 2
Speculative
Tim
e
Intl. Symp. on Workload Characterization - December 2010 7
Architectural Extensions
Multiversioned caches
Support for out-of-order spawning
Dynamic dependence synchronization
Intermediate checkpointing
Data value prediction
Intl. Symp. on Workload Characterization - December 2010 8
Outline
Introduction Background Methodology Results Conclusions
Intl. Symp. on Workload Characterization - December 2010 9
Methodology
Benchmarks– Imperative:
SPEC CPU 2006 Mediabench II
Instrumentation– GCC4 pass
Annotate loop iterations and method bodies
Mark induction, reduction variables and use of return values
Operate after the intermediate optimizations
– Object oriented: SPEC JVM 98 DaCapo
– Jikes RVM modification
Intl. Symp. on Workload Characterization - December 2010 10
Methodology
Trace Generation– Simics, full-system functional simulator– Non-intrusive trace of memory accesses
Trace-Driven Simulation– In-house Simulator-tool
Extracts threads out of loop iterations and/or method call cont.
Simulates: multi-versioned caches, OoO spawning, dynamic dependence synch, and value prediction
Intl. Symp. on Workload Characterization - December 2010 11
Methodology
Task Selection– In-order loop-level speculation
Innermost loops
Best loops out of three dynamic depth levels
– In-order method and Out-of-Order speculation Dynamic thread spawning policy favoring safer
threads
Maximum thread size heuristic
– All loops and/or methods are candidates
Intl. Symp. on Workload Characterization - December 2010 12
Outline
Introduction Background Methodology Results Conclusions
Intl. Symp. on Workload Characterization - December 2010 13
Loop-level speculation - Innermost
Iter. 1
Iter. 2
Speculative
Iter. n
…
for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { inner_loop_body1 for(k=0;k<n;k++) { spawn_thread(); innermost_loop_body } inner_loop_body2 } outer_loop_body1}
Intl. Symp. on Workload Characterization - December 2010 14
Loop-level speculation - Innermost
Intl. Symp. on Workload Characterization - December 2010 15
Iter. 1
Iter. 2
Speculative
Iter. n
for(i=0;i<m;i++){ outer_loop_body1 for(j=0;j<l;j++) { spawn_thread(); inner_loop_body1 for(k=0;k<n;k++) { innermost_loop_body } inner_loop_body2 } outer_loop_body1}
…
Loop-level speculation – Best loop depth
Intl. Symp. on Workload Characterization - December 2010 16
Loop-level speculation – Best loop depth
17
Method-level speculation - In-Order
methodmethodCont.
Speculativepid = spawn_thread();If(pid !=0) method(); method _Cont.
Intl. Symp. on Workload Characterization - December 2010 18
Method-level speculation - In-Order
19
Method-level speculation - OoO
method1
method2Cont.
Speculativepid = spawn_thread();If(pid !=0) method1();
method1 _Cont.
method1(){ method1_body1 pid = spawn_thread(); If(pid !=0) method1(); method2_cont}
method1Cont.
Tim
e
Intl. Symp. on Workload Characterization - December 2010 20
Method-level speculation - OoO
Intl. Symp. on Workload Characterization - December 2010 21
Mixed speculation - In-Order
Intl. Symp. on Workload Characterization - December 2010 22
Mixed speculation - OoO
Intl. Symp. on Workload Characterization - December 2010 23
Load Imbalance and Coverage
gcc
IOlo
op
gcc
IOm
etho
d
gcc
mix
ed
lbm
loop
lbm
mix
ed
libq
IOlo
op
libq
IOm
etho
d
libq
IOm
ixed
mcf
OoO
loop
mcf
mix
ed
sphi
nx3
IOlo
op
sphi
nx3
met
hod
sphi
nx3
OoOm
ixed
cjpe
g lo
op
cjpe
g OoO
met
hod
jpg2
Kd OoO
loop
jpg2
Kd OoO
met
hod
mpe
g4d
OoOlo
op
mpe
g4d
OoOm
etho
d
com
pres
s OoO
loop
com
pres
s m
ixed
pmd
loop
pmd
OoOm
etho
d
pmd
OoOm
ixed
0
0.2
0.4
0.6
0.8
1
0%
20%
40%
60%
80%
100%Load Imbalance
Norm
ali
zed
ove
r A
md
ah
l's
Law
S
peed
up
Perc
en
tag
e o
f P
rog
ram
Exe-
cu
tion
Intl. Symp. on Workload Characterization - December 2010 24
Results – Multi-versioning to the rescue?
Intl. Symp. on Workload Characterization - December 2010 25
Outline
Introduction Background Methodology Results Conclusions
Intl. Symp. on Workload Characterization - December 2010 26
Conclusions
Load imbalance and limited coverage important factors in realizing TLS performance
Support for OoO spawning not providing significant benefits for the task policy employed
Multi-versioned caches unlock performance in some cases but not panacea
Task selection critical
Intl. Symp. on Workload Characterization - December 2010 27
Also in the paper
In-depth analysis of high coverage loops for selected benchmarks
Comparison of TLS loop-level speculation with a state-of-the-art auto-parallelizing compiler
OoO Loop-level speculation
Outline most of the proposed architectural and compiler extensions for TLS systems
Toward a More Accurate Understanding of the Limits of the TLS Execution Paradigm
Nikolas Ioannou, Jeremy Singer, Salman Khan, Polychronis Xekalakis, Paraskevas Yiapanis, Adam Pocock, Gavin Brown, Mikel Lujan, Ian
Watson, and Marcelo CintraUniversity of Edinburgh
http://homepages.inf.ed.ac.uk/mc/
Projects/VESPA
University of Manchesterhttp://intranet.cs.man.ac.uk/
apt/projects/iTLS
Intl. Symp. on Workload Characterization - December 2010 29
Backup slides – Auto parallelizing compiler comparison
Intl. Symp. on Workload Characterization - December 2010 30
Backup slides – OoO loop
Recommended