Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
1
Constructing Virtual Architectures
on Tiled Processors
David Wentzlaff
Anant Agarwal
MIT
2
Emulators and JITs
for Multi-Core
David Wentzlaff
Anant Agarwal
MIT
3
Why Multi-Core?
4
Why Multi-Core?
Future architectures will be on-chip parallel machines
Moore’s Law provides more parallel silicon resources
Diminishing sequential returns
Growth applications are parallel
5
Why Multi-Core?
Future architectures will be on-chip parallel machines
Moore’s Law provides more parallel silicon resources
Diminishing sequential returns
Growth applications are parallel
Future architectures will be optimized for parallel applications
Hardware compatibility will be broken
6
Why Emulators and JITs on Multi-Core?
Future architectures will be on-chip parallel machines
Moore’s Law provides more parallel silicon resources
Diminishing sequential returns
Growth applications are parallel
Future architectures will be optimized for parallel applications
Hardware compatibility will be broken
7
Why Emulators and JITs on Multi-Core?
Future architectures will be on-chip parallel machines
Moore’s Law provides more parallel silicon resources
Diminishing sequential returns
Growth applications are parallel
Future architectures will be optimized for parallel applications
Hardware compatibility will be broken
Future architectures will need to run legacy applications
Market forces will require future chips to run 1983 “Frogger” for DOS
Software re-verification on new architectures too costly
8
Are Emulators and JITs for Multi-Core Different?
9
Are Emulators and JITs for Multi-Core Different?
Yes
10
Are Emulators and JITs for Multi-Core Different?
Yes
Bountiful parallel resources
Example: Code optimization cost is reduced
“hot spot” analysis may miss sequential performance
Parallelize client application
11
Are Emulators and JITs for Multi-Core Different?
Yes
Bountiful parallel resources
Example: Code optimization cost is reduced
“hot spot” analysis may miss sequential performance
Parallelize client application
Parameters for Multi-Core are different
Core-to-Core latencies reduced
12
Road Map
Exploit on-chip parallel resources to accelerate emulation
13
Road Map
Exploit on-chip parallel resources to accelerate emulation
14
Road Map
Exploit on-chip parallel resources to accelerate emulation
Focused on performance
15
Road Map
Exploit on-chip parallel resources to accelerate emulation
Focused on performance
Acceleration Mechanisms
16
Road Map
Exploit on-chip parallel resources to accelerate emulation
Focused on performance
Acceleration Mechanisms
1. Pipelining Virtual Architectures
17
Road Map
Exploit on-chip parallel resources to accelerate emulation
Focused on performance
Acceleration Mechanisms
1. Pipelining Virtual Architectures
2. Speculative Parallel Translation
18
Road Map
Exploit on-chip parallel resources to accelerate emulation
Focused on performance
Acceleration Mechanisms
1. Pipelining Virtual Architectures
2. Speculative Parallel Translation
3. Static & Dynamic Architecture Reconfiguration
19
Road Map
Exploit on-chip parallel resources to accelerate emulation
Focused on performance
Acceleration Mechanisms
1. Pipelining Virtual Architectures
2. Speculative Parallel Translation
3. Static & Dynamic Architecture Reconfiguration
Proof of concept system
All software parallel translator: x86 on Raw
20
Data RAM
Disk
Background: Translation
x86
Binary
Runtime -- Execution
x86
Binary
Code Cache Code Cache
Tags
Translator
x86 Parser &
High Level
Translator
High Level
Optimization
Low Level
Code Generation
Low Level
Optimization and
Scheduling
21
Background: “Old” Parallel Translation
22
1. Pipelining Virtual Architectures
23
1. Pipelining Virtual Architectures
Translator
Tile
Code Cache
Tile
Execution
Tile
MMU
Tile
Data Cache
Tile
24
1. Pipelining Virtual Architectures
Utilize a Tiled Processor as fabric to construct virtual processor
Translator
Tile
Code Cache
Tile
Execution
Tile
MMU
Tile
Data Cache
Tile
25
1. Pipelining Virtual Architectures
Utilize a Tiled Processor as fabric to construct virtual processor
Coarse grain pipelining to exploit parallelism
Translator
Tile
Code Cache
Tile
Execution
Tile
MMU
Tile
Data Cache
Tile
26
Sequential TranslationExecution
Time
Translation
Time
Optimization
Time
Tim
e
27
2. Speculative Parallel TranslationExecution
Time
Translation
Time
Optimization
Time
Tim
e
28
3. Reconfiguration
Different programs have different characteristics
Processor Architect uses benchmarks to choose “compromise”
processor
29
3. Reconfiguration
Different programs have different characteristics
Processor Architect uses benchmarks to choose “compromise”
processor
Static Reconfiguration
Choose different virtual machine configuration based off application
Dynamic Reconfiguration
Detect phases/programs dynamic needs and reconfigure at runtime
Cost to reconfiguration
30
Reconfiguration
Photo
Courtesy
Intel Corp.
31
Prototype System and Evaluation
32
Background: Architectures
x86 (Pentium III)
CISC instruction set
Hardware Virtual Memory (VM)
Hardware Memory Protection
Condition Codes used for branching
Hardware instruction cache
1 superscalar processor core
3-way parallelism
Out-of-order processor
Raw
RISC instruction set
No VM
No Memory Protection
No Condition Codes
Software managed instruction memory
16 Processors arranged in 4x4 mesh
4 low latency networks
In-order processors
ISA
Impl
emen
tatio
n
* Photo
Courtesy
Intel Corp.
*
33
System Design
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code Cache
Runtime -- Execution
L1 Code
Cache
L1 Data
Cache
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
L2 Data $
Bank 1
L2 Data $
Bank 2
L2 Data $
Bank 3
34
System Design
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code Cache
Runtime -- Execution
L1 Code
Cache
L1 Data
Cache
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
L2 Data $
Bank 1
L2 Data $
Bank 2
L2 Data $
Bank 3
Runtime-ExecL1
I-$
L1
D-$
35
Methodology
Cycle comparison of Raw vs. Pentium III
All results collected on real hardware
No hardware added
Same binaries (unmodified)
Metric: Slowdown
Raw executing x86 code compared by cycle against Pentium III
36
Baseline Performance
37
2. Speculative Parallel Translation
38
L2 Code Cache Miss Rate
2. Speculative Parallel TranslationC
ode
Cac
he
39
3. Static Reconfiguration
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code CacheRuntime-Exec
L1
I-$
L1
D-$
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
L2 Data $
Bank 1
L2 Data $
Bank 2
L2 Data $
Bank 3
40
3. Static Reconfiguration
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code CacheRuntime-Exec
L1
I-$
L1
D-$
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
L2 Data $
Bank 1
L2 Data $
Bank 2
L2 Data $
Bank 3
41
3. Static Reconfiguration
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code CacheRuntime-Exec
L1
I-$
L1
D-$
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
Translation
Slave
Translation
Slave
Translation
Slave
42
3. Dynamic Reconfiguration
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code CacheRuntime-Exec
L1
I-$
L1
D-$
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
L2 Data $
Bank 1
L2 Data $
Bank 2
L2 Data $
Bank 3
Translation
Slave
Translation
Slave
Translation
Slave
43
3. Dynamic Reconfiguration
Manager
L2 Code
Cache
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Translation
Slave
Banked L1.5 Code CacheRuntime-Exec
L1
I-$
L1
D-$
MMU
TLB
System
Functionality
& Loader
L2 Data $
Bank 0
L2 Data $
Bank 1
L2 Data $
Bank 2
L2 Data $
Bank 3
Translation
Slave
Translation
Slave
Translation
Slave
44
3. Reconfiguration Zoom
45
Baseline Performance Analysis
46
Baseline Performance Analysis
47
Baseline Performance Analysis
Intrinsic Raw Emulator Pentium III
latency occupancy latency occupancy
L1 Cache Hit 6 4 3 1
L2 Cache Hit 87 87 7 1
L2 Cache Miss 151 87 79 1
48
Baseline Performance Analysis
Intrinsic Raw Emulator Pentium III
latency occupancy latency occupancy
L1 Cache Hit 6 4 3 1
L2 Cache Hit 87 87 7 1
L2 Cache Miss 151 87 79 1
CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +
(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –
memory_access_rate) * non_memory_CPI)
49
Baseline Performance Analysis
Intrinsic Raw Emulator Pentium III
latency occupancy latency occupancy
L1 Cache Hit 6 4 3 1
L2 Cache Hit 87 87 7 1
L2 Cache Miss 151 87 79 1
CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +
(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –
memory_access_rate) * non_memory_CPI)
Memory CPI 3.9 1
50
Baseline Performance Analysis
Memory System 3.9x slowdown
Lack of ILP 1.3x slowdown
Condition Codes (Flags) 1.1x slowdown
51
Baseline Performance Analysis
Memory System 3.9x slowdown
Lack of ILP 1.3x slowdown
Condition Codes (Flags) x 1.1x slowdown
5.5x slowdown
52
Baseline Performance Analysis
Memory System 3.9x slowdown
Lack of ILP 1.3x slowdown
Condition Codes (Flags) x 1.1x slowdown
5.5x slowdown
Code Cache Misses 1 – 20x slowdown
53
Baseline Performance Analysis
54
Future Work
Hardware additions to facilitate parallel emulation
x86 Server farm on a chip
Inter-Virtual Processor dynamic load balancing
Differing forms of Dynamic Reconfiguration
Vary number of functional units
55
Questions ?
56
Extras