Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
HW/SW Co-designed Processors: Challenges, Design Choices and a
Simulation Infrastructure for Evaluation
Rakesh Kumar1, José Cano1, Aleksandar Brankovic2, Demos Pavlou3,
Kyriakos Stavrou3, Enric Gibert4, Alejandro Martínez5, Antonio González6
1University of Edinburgh, UK 2Intel 311pets
4Pharmacelera 5ARM 6Universitat Politècnica de Catalunya, Spain
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)Santa Rosa, California, USA - April 24-25, 2017
Outline
• HW/SW co-designed processors
• Building a simulation infrastructure
• DARCO
• Evaluation
• Conclusions
2
HW/SW co-designed processors
3
Energy Efficiency
Performance
HW/SW co-designed processor
Operating System
Translation Optimization Layer (Software)
Libraries
Application Programs
Execution Hardware
Host ISA
Guest ISA
Conventional processor
Operating System
Execution Hardware
Libraries
Application Programs
ISA
• Simple Host ISA
– In-order cores; move complexity to software layer
• Dynamic Binary Optimizations in software (TOL)
– Aggressive and speculative
– Exploit application behavior at runtime
• IBM DAISY (1997)
– Targets binary compatibility from PowerPC to VLIW architectures
• IBM BOA (1999)
– Targets high frequency PowerPC through simple hardware design
• Transmeta Crusoe (2000) and Efficeon (2003)
– Execute x86 binaries on proprietary VLIW with low power consumption
– Better energy efficiency than Intel Pentium III
• Nvidia Denver (2014)
– Executes ARMv8 binaries on proprietary in-order core
– Applying dynamic optimizations matches Out-of-order Intel Haswell
HW/SW co-designed processors: History
20031997 1999 2000 2014
DAISY BOA
4???
• IBM DAISY (1997)
– Targets binary compatibility from PowerPC to VLIW architectures
• IBM BOA (1999)
– Targets high frequency PowerPC through simple hardware design
• Transmeta Crusoe (2000) and Efficeon (2003)
– Execute x86 binaries on proprietary VLIW with low power consumption
– Better energy efficiency than Intel Pentium III
• Nvidia Denver (2014)
– Executes ARMv8 binaries on proprietary in-order core
– Applying dynamic optimizations matches Out-of-order Intel Haswell
HW/SW co-designed processors: History
20031997 1999 2000 2014
DAISY BOA
5???
Anything missing?No major academic project!
Can lack of simulation infrastructure be the reason?
Outline
• Introduction
• HW/SW co-designed processors
• Building a simulation infrastructure
• DARCO
• Evaluation
• Conclusions
6
What will a simulation infrastructure enable?
• Where to implement (HW or SW?) microarchitectural features like
– Instruction decoding/reordering, memory disambiguation, register renaming, …
7
Operating System
Translation Optimization Layer (Software)
Libraries
Application Programs
Execution Hardware
Host ISA
Guest ISA
What will a simulation infrastructure enable?
• Where to implement (HW or SW?) microarchitectural features like
– Instruction decoding/reordering, memory disambiguation, register renaming, …
• How to reduce “startup delay”
– One of the major problems of Transmeta products
• When and where to translate/optimize the guest binaries
– As soon as code becomes “hot”?, wait for a core to become idle?,…
• How to address speculative execution (memory, control)
– Checkpointing granularity?, find susceptible speculation failure points?
• When and how to profile the execution
– Overhead vs opportunity for improvement
8
What will a simulation infrastructure enable?
• Where to implement (HW or SW?) microarchitectural features like
– Instruction decoding/reordering, memory disambiguation, register renaming, …
• How to reduce “startup delay”
– One of the major problems of Transmeta products
• When and where to translate/optimize the guest binaries
– As soon as code becomes “hot”?, wait for a core to become idle?,…
• How to address speculative execution (memory, control)
– Checkpointing granularity?, find susceptible speculation failure points?
• When and how to profile the execution
– Overhead vs opportunity for improvement
9
A simulation infrastructure can help
evaluate trade-offs and design choices
Simulation infrastructure: Complexity
10
• Compilation framework
– Code analysis/translation
– Optimizations
– Code generation
• Runtime system
– Profiling and instrumentation
– Profile-guided optimizations
• Microarchitectural simulator
– Model components like pipeline, caches, …
– Allow sampling
Simulation infrastructure: Complexity
11
• Compilation framework
– Code analysis/translation
– Optimizations
– Code generation
• Runtime system
– Profiling and instrumentation
– Profile-guided optimizations
• Microarchitectural simulator
– Model components like pipeline, caches, …
– Allow sampling
Simulation infrastructure complexity =
Compilation framework + Runtime system +
Microarchitectural simulator
• Correctness
– It should not change program behavior
• Minimum software layer (TOL) overhead
– TOL execution time must be small
• Minimum emulation cost
– Host to guest instruction ratio must be low
• Support for multiple guest ISAs (front-ends)
– Enables wider applicability
• Plug and play support
– Easy to include/evaluate new features
• Debugging
– Strong debug toolchain
Simulation infrastructure: Requirements
12
TOL
Hardware
x86 ARM Power
Outline
• Introduction
• HW/SW co-designed processors
• Building a simulation infrastructure
• DARCO: Infrastructure for Research on HW/SW Co-designed Processors
• Evaluation
• Conclusions
13
DARCO: The big picture
14
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
• Models a processor that executes x86 code on a RISC host architecture
• Four main components: Co-designed, x86, Timing Simulator, Controller
DARCO: Co-designed Component
• Models the functionality of a HW/SW co-designed processor
• Composed of TOL and host ISA functional emulator (user code)
• Maintains emulated x86 architectural and memory states
15
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
DARCO: x86 Component
• Provides a full-system functional emulator for the guest x86 ISA
• Maintains authoritative x86 architectural and memory states
• Filters instruction stream and passes user code to co-designed component
16
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
DARCO: Timing Simulator
• Models a parameterized in-order core
• Can distinguish application and TOL code
• Includes power and energy modelling (McPAT)
17
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
DARCO: Controller
• Provides full control over the app execution and debugging utilities
• Compares authoritative and emulated x86 states to ensure correctness
18
User
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
19
DARCO: Starting execution
x86 Component Tracker Co-designed ComponentController
User
CommandsUser : Execute application XYZ
Controller : Send proper command to x86 component
x86 OS : Starts application XYZ
Tracker : Identifies application XYZ
Controller : Request 1st code page from x86 component and send to TOL with initial state
TOL : Load the code page and starts emulating
TOLXYZ Exec Space
Code Code Reg File Data Reg File
Tracks the emulated
application. Passes
user-level code to co-
designed component
DataXYZ
20
DARCO: Handling system calls
Co-designed Componentx86 ComponentController
User
Events and CommandsTOL : Execution sequence reaches a system call
TOL : Flush state and send request to x86 component
x86 Comp : Reach the same execution point and make the system call
Controller : Send new state to TOL
TOL : Continue emulation
TOLXYZ Exec Space
Code Code
Tracker
Reg File Data Reg File
Sys CallSys Call
DataHandler
The system call is faked – it is
NOT emulated in the co-
designed component
21
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
DARCO: Translation Optimization Layer (TOL)
• Interpretation (IM)
– x86 instructions interpreted sequentially
– Profiling for execution frequency of basic blocks
• Basic block translation (BBM)
– x86 basic blocks translated to intermediate representation
– Lightweight optimizations (dead code, constants)
– Profiling for branch directions and execution frequency of basic blocks
• Superblock optimization (SBM)
– Bigger optimization regions across multiple x86 basic blocks
– Aggressive and speculative optimizations
– No profiling (reduces overhead)
Cold Code
22
Warm Code
Hot Code
DARCO: TOL execution modes
23
x86 eip
Interpret
> BBth
No
BB translateYes
Store in Code $
Chain
Execute from Code $
In Code
$?
No Yes
From Code $
Create SB
Optimize SB
> SBth
Yes
No
Already optimized
DARCO: TOL execution flow
24
x86 eip
Interpret
> BBth
No
BB translateYes
Store in Code $
Chain
Execute from Code $
In Code
$?
No Yes
From Code $
Create SB
Optimize SB
> SBth
Yes
No
Already optimized
- x86 to Intermediate Representation (IR)- Control Speculation- Loop Unrolling
- Static Single Assignment (SSA)- Forward Pass
- Constant Folding - Constant/Copy Propagation- Common Subexpression Elimination
- Backward Pass- Dead Code Elimination
- Data Dependence Graph (DDG)- Memory Alias Analysis- Redundant Load Removal- Store Forwarding
-Instruction Scheduling- Data Speculation
- Register Allocation- Code Generation (opt. IR to Host ISA)
DARCO: TOL execution flow
ACD
DARCO: Building a superblock
25
A
branch
B
branch
C
branch
D
branch
E
branch
5% 95%
2%98%
30% 70%
A
branch
C
branch
D
branch
Exec counter > SBth
assert
assert
branch
Unbiased Branch
SuperBlock
ACD
DARCO: Building a superblock
26
A
branch
B
branch
C
branch
D
branch
E
branch
5% 95%
2%98%
30% 70%
A
branch
C
branch
D
branch
Exec counter > SBth
assert
assert
branch
Unbiased Branch
Impact: Bigger optimization regions
SuperBlock
DARCO: Timing Simulator
27
x86 Componentx86 Binary
Commands Path Commands Path
Timing Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-designed Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86 Memory State
Authoritative x86 Register State
Authoritative x86 Memory State
State Checker
Emulated x86 Register State
DARCO: Timing Simulator
• Models a configurable RISC superscalar in-order processor
• Pipeline decoupled into: Front-End, Instruction Queue, Back-End
28
Front End
AC IF DEC
In. Queue Back End
ISSUE RR EXE WB
TLB L1
TLB L2
L1-I$ L2$ L1-D$
Main Memory
PrefetcherBPBTB
DARCO: Meeting the requirements
• Correctness
– Architectural/memory states compared periodically
• Minimum TOL overhead
– 3-stages for translation/optimization; Chaining
• Minimum emulation cost
– Aggressive/speculative optimizations
• Support for multiple guest ISAs (front-ends)
– Incorporating additional front-ends is straightforward
• Plug and play support
– Modular design
• Debugging
– Powerful debug toolchain
29
TOL
x86 ARM? Power?
x86 Componentx86
Binary
Commands Path Commands Path
Timin
g
Simulator
Process Tracker
Data and Instruction Path
Data and Instruction Path
Co-design Component
Controller
x86 Functional Emulator
RISCFunctional Emulator
x86 OSTranslation Optimization Layer (TOL)
Emulated x86
Memory State
Authoritative x86 Register
State
Authoritative x86 Memory
State
State Checker
Emulated x86
Register State
SBMOptim.
x86 eip
Interpret
> BBt
h
No
BB translate
Yes
Store in Code $
Chain
Execute from Code $
In Code $?
No
Yes
From Code $
Create SB
Optimize SB
> SBt
h
Yes
No
Already optimized
Controller
Controller
Outline
• Introduction
• HW/SW co-designed processors
• Building a simulation infrastructure
• DARCO
• Evaluation
• Conclusions
30
Evaluation methodology
• Benchmarks analyzed
– SPEC CPU2006: Integer, Floating point
– Physicsbench
• We simulate 4B (billion) x86 instructions
– Some benchmarks have fewer instructions and run to completion
• DARCO speed (x86 instructions)
– 3.4 MIPS and ~0.4 MIPS enabling the Timing Simulator
– Intel Xeon E5-2630L (2.40 GHz) and L5630 Dual (2.13 GHz) with 24-128 GB RAM
• Highlights
– x86 dynamic code distribution
– Execution time distribution
– Emulation cost
31
0%10%20%30%40%50%60%70%80%90%
100%
40
0.p
erlb
ench
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
45
8.s
jen
g
46
2.li
bq
uan
tum
46
4.h
26
4re
f
47
1.o
mn
etp
p
47
3.a
star
48
3.x
alan
cbm
k
41
0.b
wav
es
43
3.m
ilc
43
4.z
eusm
p
43
5.g
rom
acs
43
6.c
actu
sAD
M
43
7.le
slie
3d
44
4.n
amd
45
0.s
op
lex
45
3.p
ovr
ay
45
4.c
alcu
lix
45
9.G
emsF
DTD
47
0.lb
m
48
2.s
ph
inx3
10
0.b
reak
able
10
1.c
on
tin
uo
us
10
2.d
efo
rmab
le
10
4.e
xplo
sio
ns
10
5.h
igh
spee
d
10
6.p
erio
dic
10
7.r
agd
oll
SPEC
INT2
00
6
SPEC
FP2
00
6
Ph
ysic
sben
ch
SPECINT2006 SPECFP2006 Physicsbench Averages
X8
6 d
ynam
ic in
stru
ctio
n p
erc
en
tage
Insn executed in IM Insn execute in BBM Insn executed in SBM
Evaluation: x86 dynamic code distribution
32
• ~90% of the dynamic instruction stream comes from SBM (optimized code)
• Represents ~14% of the static code (not shown)
90%
Lower repetition factor of the SuperBlocks
Evaluation: Execution time distribution
• Application: cycles executing RISC instructions that emulate the x86 code
• TOL (overhead): cycles translating/optimizing the guest code, etc
33
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
40
0.p
erlb
ench
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
45
8.s
jen
g
46
2.li
bq
uan
tum
46
4.h
26
4re
f
47
1.o
mn
etp
p
47
3.a
star
48
3.x
alan
cbm
k
41
0.b
wav
es
43
3.m
ilc
43
4.z
eusm
p
43
5.g
rom
acs
43
6.c
actu
sAD
M
43
7.le
slie
3d
44
4.n
amd
45
0.s
op
lex
45
3.p
ovr
ay
45
4.c
alcu
lix
45
9.G
emsF
DTD
47
0.lb
m
48
2.s
ph
inx3
10
0.b
reak
able
10
1.c
on
tin
uo
us
10
2.d
efo
rmab
le
10
4.e
xplo
sio
ns
10
5.h
igh
spee
d
10
6.p
erio
dic
10
7.r
agd
oll
SPEC
INT2
00
6
SPEC
FP2
00
6
Ph
ysic
sben
ch
SPECINT2006 SPECFP2006 Physicsbench Averages
Pe
rce
nta
ge o
f ex
ecu
tio
n t
ime
TOL (overhead) Application 20%
0
1
2
3
4
5
6
40
0.p
erlb
ench
40
1.b
zip
2
40
3.g
cc
42
9.m
cf
44
5.g
ob
mk
45
8.s
jen
g
46
2.li
bq
uan
tum
46
4.h
26
4re
f
47
1.o
mn
etp
p
47
3.a
star
48
3.x
alan
cbm
k
41
0.b
wav
es
43
3.m
ilc
43
4.z
eusm
p
43
5.g
rom
acs
43
6.c
actu
sAD
M
43
7.le
slie
3d
44
4.n
amd
45
0.s
op
lex
45
3.p
ovr
ay
45
4.c
alcu
lix
45
9.G
emsF
DTD
47
0.lb
m
48
2.s
ph
inx3
10
0.b
reak
able
10
1.c
on
tin
uo
us
10
2.d
efo
rmab
le
10
4.e
xplo
sio
ns
10
5.h
igh
spee
d
10
6.p
erio
dic
10
7.r
agd
oll
SPEC
INT2
00
6
SPEC
FP2
00
6
Ph
ysic
sben
ch
SPECINT2006 SPECFP2006 Physicsbench Averages
Ho
st In
stru
ctio
ns
pe
r x8
6 In
stru
ctio
n
Evaluation: Emulation cost
• Number of host instructions generated per x86 instruction ~3
• Reasonable for translating from CISC to RISC
34
3.2
Conclusions
• HW/SW co-designed processors
– Huge potential to improve energy efficiency and performance
– Several industrial projects
– No major project in academia (lack of simulation infrastructures)
• Challenges
– To become mainstream (e.g. startup delay)
– In building a simulation infrastructure (e.g. software layer overhead)
• DARCO
– May enable academic research in HW/SW co-designed domain
– Provides a modular simulation infrastructure
– Easy to add new components/optimizations
35
HW/SW Co-designed Processors: Challenges, Design Choices and a
Simulation Infrastructure for Evaluation
Rakesh Kumar1, José Cano1, Aleksandar Brankovic2, Demos Pavlou3,
Kyriakos Stavrou3, Enric Gibert4, Alejandro Martínez5, Antonio González6
1University of Edinburgh, UK 2Intel 311pets
4Pharmacelera 5ARM 6Universitat Politècnica de Catalunya, Spain
THANK YOU !!!