Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization
Support
Kyle C. Hale and Peter Dinda
2
Hybrid Runtimes
runtime not limited to abstractions exposed by syscall interface
HPDC ‘15
opportunity for leveraging privileged HW features
the runtime IS the kernel
3
user-mode
kernel-mode
CURRENT OS/R MODEL
PARALLEL RUNTIME
GENERAL PURPOSE OS
PARALLEL APP
HARDWARE
4
user-mode
kernel-mode
PARALLEL RUNTIME
GENERAL PURPOSE OS
PARALLEL APP
HARDWARE HYBRID RUNTIME
PARALLEL APP
HARDWAREkernel mashup of OS and runtime
THE HYBRID RUNTIME
5
6
Racket
Legion
NDPC
NESL
7
Spee
dup
over
Lin
ux
1
1.1
1.2
1.3
1.4
Legion Processor Count (Cores)
1 2 4 8 16 32 64 128 200 220
Xeon Phi x64
8
Spee
dup
over
Lin
ux
1
1.1
1.2
1.3
1.4
Legion Processor Count (Cores)
1 2 4 8 16 32 64 128 200 220
Xeon Phi x64
1.19x speedup over
linux
9
Spee
dup
over
Lin
ux
1
1.1
1.2
1.3
1.4
Legion Processor Count (Cores)
1 2 4 8 16 32 64 128 200 220
Xeon Phi x64
1.37x speedup over
linux
1.19x speedup over
linux
10
TWO ENABLING TOOLS
11
kernel framework
12
HVMTHE HYBRID VIRTUAL MACHINE
bridge HRT with legacy OS
OUTLINE
13
Background/Overview
Nautilus
Deployment Models
Hybrid Virtual Machine
Multiverse & Future Work
14
PARALLEL APP
RUNTIME
THREADS SYNC PAGING EVENTS HW INFO BOOTSTRAP TIMERS IRQS CONSOLE
user-mode
kernel-mode
HARDWARE
Nautilus primitives & utilities (HRT can use or not use any of them)
aerokernel
NAUTILUS
15
Nautilus under the hood
16
Nautilus
Parallel Runtime System
17
kernel primitives should be SIMPLE and FAST
runtime developer can easily reason about them!
18
start with familiar interfaces
condition variables
threads
mutexes/locks
fork/join parallelism
memory management
19
Parallel Runtime System
logical CPUs
map runtime’s logical view of machine onto physical HW
20
threadsunified, shared address space
threads can operate preemptively or cooperatively*
*for runtimes that require more determinism
21
thread creation is FASTC
ycle
s
0
1750000
3500000
5250000
7000000
2 4 8 16 32 64
NautilusLinux (pthreads)
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
22
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
Cyc
les
0
1750000
3500000
5250000
7000000
2 4 8 16 32 64
Linux (pthreads)
thread creation is FAST
23
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
Cyc
les
0
1750000
3500000
5250000
7000000
2 4 8 16 32 64
NautilusLinux (pthreads)
~3ms
~90µs
thread creation is FAST
24
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
user-mode software events are SLOW
hardware (IPI)
lower is better
0
5000
10000
15000
20000
25000
trig
ger
late
ncy
(cyc
les)
30000
35000
cond. var. futex
software events
25
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
nautilus events triggers are FASTER
hardware (IPI)0
5000
10000
15000
20000
25000
trig
ger
late
ncy
(cyc
les)
30000
35000
cond. var. naut. cond. var.futex
naut. cond. var.
w/IPI
26TR
IGG
ER L
ATEN
CY (C
YCLE
S)
0180360540720900
10801260144016201800
IPI Nemo
27TR
IGG
ER L
ATEN
CY (C
YCLE
S)
0180360540720900
10801260144016201800
IPI Nemo~100 cycles
28
memory
29
static identity map
bootup
1GB
.
.
.
2MB
.
.
.
physical memory
page size
eliminates expensive page faults
reduces TLB misses + shootdowns
increases performance under virtualization
30
31
NUMAfirst-touch
stack
NUMA domain NUMA domainruntime has FULL control over thread placement and
memory layout
32
philix
33
hostXeon Phi
philix host utility
Intel MPSS stack
custom OS
virtual console
34
Nautilus 25,000 lines of mostly C
Legion 43,000 lines of C++
Additions for Legion 800 lines of C/C++
Additions for Xeon Phi 1350 lines of C
Nautilus is fairly small
35
What do we lose?
36
familiar environment
37driver ecosystem
38
protection/isolation
OUTLINE
39
Background/Overview
Nautilus
Deployment Models
Hybrid Virtual Machine
Multiverse & Future Work
40
HARDWARE
HRT
APPLICATION
DEDICATED
full HW access
41
HARDWARE
HRT
APPLICATION
PARTITIONED
full HW access
CORE 0
LINUX STACK
… CORE K-1 CORE K CORE N-1…
42
HARDWARE
HRT
PARALLEL APP
VM
LINUX STACK
VMM
VIRTUAL MACHINE 0 VIRTUAL MACHINE 1
43
HARDWARE
HRT
PARALLEL APP
Hybrid Virtual Machine
LINUX STACK
VMM
HVMregular OS vcores HRT vcores
OUTLINE
44
Background/Overview
Nautilus
Deployment Models
Hybrid Virtual Machine
Multiverse & Future Work
45
HRT
user-mode
kernel-mode
user-mode
kernel-mode
PARALLEL RUNTIME
GENERAL PURPOSE OS
PARALLEL APP
PARALLEL APP
NODE HARDWARE NODE HARDWARE
HVM
NODE HARDWARE
GENERAL VIRTUALIZATION MODEL
SPECIALIZED VIRTUALIZATION MODEL
regular OS (ROS)
performance pathlegacy functionality from ROS via HVM
46
ROS vcore
ROS vcore
HRT vcore
Virtual Machine
IPI IPI
ROS/HRT setup
HRT vcore
IPI
IPI
47
ROS vcore
ROS vcore
HRT vcore
VM memory
Virtual Machine
ROS/HRT setup
ROS visible memory
HRT vcore
48
ROS vaddr space HRT vaddr space
merged address spacehigher half
ROS kernel
app + runtime code & data
higher half Aerokernel
app + runtime code & data
merged
49ROS/HRT communication
LINUX
HYPERVISOR + HVM
HRTHRT reboot requestAddress space mergerHRT function invocation
shared data page
50ROS/HRT communication
LINUX
HYPERVISOR + HVM
HRTSynchronous operation request
shared data page
comm. area
51
Cycles Time
Addr space merge ~33K 15µs
Asynch function invocation ~25K 11µs
Synch function invocation (remote
socket)~1060 482ns
Synch function invocation (same
socket)~790 359ns
ROS/HRT communication
52ROS boot
ROS BSPINIT-SIPI-SIPI
INIT-SIPI-SIPI
INIT-SIPI-SIPI
BOOTSTRAP
BOOTSTRAP
BOOTSTRAP
ROS APs
53HRT boot
HVM sets cores up in long mode*
*no need for INIT-SIPI-SIPI sequence
HVM sets up registers, system tables
GDT
IDT
TSS
REGISTERS
INIT PAGE
TABLES
HVM boots cores simultaneously
54
LINUX FORK + EXEC ~ 714µs
HVM + HRT CORE BOOT ~ 61µs
55
how do we take a legacy runtime to
the HRT + X model?
PORT
56porting a runtime/app to an a new OS
environment is…
DIFFICULT TIME-CONSUMING
ERROR-PRONE
57
development cycle:
ADD FUNCTION
REBUILD
BOOT HRT
do{
}while(HRTfallsover)
58
much of the functionality is
NOT ON THE CRITICAL PATH
59
we want to make this
easier
OUTLINE
60
Background/Overview
Nautilus
Deployment Models
Hybrid Virtual Machine
Multiverse & Future Work
61
give us your legacy runtime
62
we automatically transform it to run
as an HRT
bridged with a legacy OS in kernel mode
(AUTOMATIC HYBRIDIZATION)
63
rebuild with our toolchain
LEGACY APP/RUNTIME
64
LEGACY APP/RUNTIME
MULTIVERSE RUNTIME LAYER
AEROKERNEL BINARY
LINUX
HYPERVISOR + HVM
runs like normal
application
BOOTED AEROKERNEL
LEGACY APP/RUNTIME
MULTIVERSE RUNTIME LAYER
kernel-mode
65
66
bridge HRT with legacy OSHybrid Virtual Machine automatic transformation: legacy app+runtime HRT
Multiverse
4 parallel runtimes (Legion, Racket, NDPC, NESL)
67
thank you
68
my webpage: halek.co lab: presciencelab.org download nautilus: nautilus.halek.co v3vee project: v3vee.org
69
backups
70
thread fork
71
interrupt driven execution
72
memory and paging
73
static identity map
bootup
1GB
.
.
.
2MB
.
.
.
physical memory
page size
eliminates expensive page faults
reduces TLB misses + shootdowns
increases performance under virtualization
74
75Un
ified
TLB
Miss
es
1
1000
1000000
log scale
76Un
ified
TLB
Miss
es
1
1000
1000000
Linux Nautilus
77
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
thread creation is PREDICTABLE
1000
10000
100000
1e6
1e7
1e8
cycl
es
Linux (pthreads) Nautilus
log scale
lower is better
kernel threads
78
binding to physical processor is GUARANTEED
79
RUNTIME control overvirtual CPUphysical
80
interrupts &
events
81
asynchronous notifications
82
EVENT COMPLETEStrigger
DEPENDENT TASK EXECUTES
trigger latency
83
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
user-mode software events are SLOW
hardware (IPI)
lower is better
0
5000
10000
15000
20000
25000
trig
ger
late
ncy
(cyc
les)
30000
35000
cond. var. futex
84
x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM
nautilus events triggers are FASTER
hardware (IPI)0
5000
10000
15000
20000
25000
trig
ger
late
ncy
(cyc
les)
30000
35000
cond. var. naut. cond. var.futex
naut. cond. var.
w/IPI
85
handle_event()handle_event()
86
asynchronous, IPI-based execution
87
NOT POSSIBLE IN LINUX USERSPACE
88TR
IGG
ER L
ATEN
CY (C
YCLE
S)
0180360540720900
10801260144016201800
IPI Nemo
lower is better
89TR
IGG
ER L
ATEN
CY (C
YCLE
S)
0180360540720900
10801260144016201800
IPI Nemo
90TR
IGG
ER L
ATEN
CY (C
YCLE
S)
0180360540720900
10801260144016201800
IPI Nemo~100 cycles
91
this gives us an
incremental path for creating HRTs
92
start out with a working
HRT system
93
then pull functionality
into the HRT for hot spots
94
RACKET
most widely used Scheme implementation
downloaded ~300 times/day
complex runtime with JIT
95
LEGACY APP/RUNTIME
MULTIVERSE RUNTIME LAYER
AEROKERNEL BINARY
LINUX
HYPERVISOR + HVM
BOOTED AEROKERNEL
LEGACY APP/RUNTIME
MULTIVERSE RUNTIME LAYER
system call
system call
ROS/HRT communication
96
97
0
500000
1×106
1.5×106
2×106
2.5×106
pthread condvar
futex wakeup
Aerokernel condvar
Aerokernel condvar + IPI
broadcast IPI
µ = 995795
min = 17538max = 2.17277e+06
σ = 544512
µ = 370630
min = 16402max = 1.89553e+06
σ = 199680
µ = 265820
min = 3258max = 612959
σ = 159421
min = 7842max = 464015µ = 132417σ = 98637.4
min = 1252max = 57467µ = 12827.3σ = 2931.32
Cyc
les
to W
akeu
p
Broadcast wakeups on x64
Nautilus events
98
0
0.2
0.4
0.6
0.8
1
1000 10000 100000 1×106
CD
F
σ
pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI
Wakeup deviation on x64
99
0
0.2
0.4
0.6
0.8
1
1000 10000 100000 1×106
CD
F
σ
pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI
HW lower bound
user-modeevents
Nemo events(kernel-mode)
2x
Wakeup deviation on x64
Nautilus events
100
0
500000
1×106
1.5×106
2×106
2.5×106
pthrea
d con
dvar
futex
wakeu
p
broad
cast
IPI
µ = 995795
min = 17538max = 2.17277e+06
σ = 544512
µ = 370630
min = 16402max = 1.89553e+06
σ = 199680
µ = 12827.3
min = 1252max = 57467
σ = 2931.32
Cyc
les
to W
akeu
p
Broadcast wakeups on x64
101
0
0.2
0.4
0.6
0.8
1
710 720 730 740 750 760 770 780 790 800
CD
F
Cycles mesaured from BSP (core 224)
95th percentile = 746 cycles
Unicast IPIs on phi
102
0
0.2
0.4
0.6
0.8
1
2080 2100 2120 2140 2160 2180 2200
CD
F
Cycles mesaured from BSP (core 224)
95th percentile = 2105 cycles
Roundtrip IPIs on phi
103
0
0.2
0.4
0.6
0.8
1
15 15.5 16 16.5 17 17.5 18 18.5
CD
F
σ
95th percentile σ = 31 cycles
Multicast IPIs on phi
104
0
0.2
0.4
0.6
0.8
1
1000 10000 100000 1×106
CD
F
σ
pthread condvarfutex broadcast
broadcast IPI
Wakeup deviation on x64
105
0
0.2
0.4
0.6
0.8
1
1000 10000 100000 1×106
CD
F
σ
pthread condvarfutex broadcast
broadcast IPI
HW lower bound
user-mode events
~70x
Wakeup deviation on x64
106
0
1000
2000
3000
4000
5000
6000
7000
Linux (pthreads) Aerokernel threads
µ = 1412.67
min = 1352max = 2438
σ = 114.377
µ = 1269.52
min = 1176max = 1391
σ = 48.3902
Cyc
les
thread context switch on x64
107thread create + launch on phi
10000
100000
1×106
1×107
Linux (pthreads) Linux (kernel threads) Nautilus kernel threads
Cyc
les
108
0
2×106
4×106
6×106
8×106
1×107
1.2×107
2 4 8 16 32 64
Cyc
les
Threads
Linux pthreadsLinux kernel threads
Nautilus threads
thread create + launch (many threads) on x64
109
0
2×107
4×107
6×107
8×107
1×108
1.2×108
1.4×108
1.6×108
2 4 8 16 32 64 128 228
Cyc
les
Threads
Linux pthreadsLinux kernel threads
Nautilus threads
thread create + launch (many threads) on phi
110Unicast IPIs on x64
0
0.2
0.4
0.6
0.8
1
1000 1200 1400 1600 1800 2000 2200 2400
CD
F
Cycles mesaured from BSP (core 0)
95th percentile = 1728 cycles
111
0
0.2
0.4
0.6
0.8
1
3600 3800 4000 4200 4400 4600 4800 5000
CD
F
Cycles mesaured from BSP (core 0)
95th percentile = 4640 cycles
Roundtrip IPIs on x64
112
0
0.2
0.4
0.6
0.8
1
2000 3000 4000 5000 6000 7000 8000
CD
F
σ
95th percentile σ = 2812 cycles
Multicast IPIs on x64
113
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000
CD
F
Cycles mesaured from BSP (core 0)
unicast IPIsynchronous event (mem polling)
Unicast IPI vs memory polling
114
ipi sync cdf (this one has annotations)
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000
CD
F
Cycles mesaured from BSP (core 0)
unicast IPIsynchronous event (mem polling)
coherence network
interruptnetwork
Unicast IPI vs memory polling
115
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000
CD
F
Cycles mesaured from BSP (core 0)
unicast IPIsynchronous event (mem polling)projected remote syscall
Unicast IPI vs memory polling
116
0
0.2
0.4
0.6
0.8
1
0 500 1000 1500 2000
CD
F
Cycles mesaured from BSP (core 0)
unicast IPIsynchronous event (mem polling)projected remote syscall
cost of dest.handling
Unicast IPI vs memory polling
117
pthrea
d con
dvar
futex
wakeu
p
unica
st IPI
µ = 25722.7
min = 24333max = 27834σ = 618.407
µ = 15637.4
min = 13311max = 22343σ = 1173.37
µ = 740.85min = 732max = 761σ = 5.71205
Single wakeup on phi
118
wakeup phi many
0
0.2
0.4
0.6
0.8
1
10 100 1000 10000 100000 1×106 1×107
CD
F
σ
pthread condvarfutex broadcast
broadcast IPI
Wakeup deviation on phi
119
0
0.2
0.4
0.6
0.8
1
10 100 1000 10000 100000 1×106 1×107
CD
F
σ
pthread condvarfutex broadcast
broadcast IPI
HW lower bound
user-mode events
~79000x
Wakeup deviation on phi
120
0
0.2
0.4
0.6
0.8
1
0 5000 10000 15000 20000 25000 30000 35000 40000
CD
F
Cycles to Wakeup
U. mode condvarFutex wakeup
Oneway IPI
Single wakeup on x64 (CDF)
121
0
0.2
0.4
0.6
0.8
1
0 5000 10000 15000 20000 25000 30000
CD
F
Cycles to Wakeup
U. mode condvarFutex wakeup
Oneway IPI
Single wakeup on phi (CDF)
122
0
5000
10000
15000
20000
25000
30000
pthread condvar
futex wakeup
Aerokernel condvar
Aerokernel condvar + IPI
unicast IPI
µ = 25722.7min = 24333max = 27834σ = 618.407
µ = 15637.4min = 13311max = 22343σ = 1173.37
µ = 9014.73min = 5528max = 14074σ = 2507.14
min = 5953max = 7305µ = 6483.89σ = 213.517
min = 732max = 761µ = 740.85σ = 5.71205
Cyc
les
to W
akeu
p
Single wakeup on phi
123
0
2×106
4×106
6×106
8×106
1×107
1.2×107
pthread condvar
futex wakeup
Aerokernel condvar
Aerokernel condvar + IPI
broadcast IPI
µ = 5.64881e+06
min = 31743max = 1.17807e+07
σ = 3.08505e+06
µ = 2.25096e+06
min = 28545max = 4.55009e+06
σ = 1.27551e+06
µ = 1.02075e+06
min = 7199max = 7.25596e+06
σ = 1.1411e+06 min = 8267max = 3.93142e+06µ = 558155σ = 428849
min = 1016max = 1225µ = 1153.68σ = 17.1824
Cyc
les
to W
akeu
p
Many wakeups on phi
124
wakeup phi many wkernel cdf
0
0.2
0.4
0.6
0.8
1
10 100 1000 10000 100000 1×106 1×107
CD
F
σ
pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI
Wakeup deviation on phi
125
0
0.2
0.4
0.6
0.8
1
10 100 1000 10000 100000 1×106 1×107
CD
F
σ
pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI
HW lower bound
Nemo events(kernel-mode)
user-mode events
4x
Wakeup deviation on phi
126
pthrea
d con
dvar
futex
wakeu
p
broad
cast
IPI
µ = 5.64881e+06
min = 31743max = 1.17807e+07
σ = 3.08505e+06
µ = 2.25096e+06
min = 28545max = 4.55009e+06
σ = 1.27551e+06
µ = 1153.68
min = 1016max = 1225
σ = 17.1824
Broadcast wakeups on phi
127
0
0.2
0.4
0.6
0.8
1
0 5000 10000 15000 20000 25000 30000 35000 40000
CD
F
Cycles to Wakeup
U. mode condvarFutex wakeup
K.mode condvarK.mode condvar + IPI
Oneway IPI
Single wakeup on x64 (CDF)
128
0
0.2
0.4
0.6
0.8
1
0 5000 10000 15000 20000 25000 30000
CD
F
Cycles to Wakeup
U. mode condvarFutex wakeup
K.mode condvarK.mode condvar + IPI
Oneway IPI
Single wakeup on phi (CDF)
129
0 10 20 30 40 50 60 70 80 90
fannk
uch-r
edux
binary
-tree-2 fas
tafas
ta-3
nbod
y
spec
tral-n
orm
mande
lbrot-
2
Run
time
(s)
NativeVirtual
Multiverse
racket multiverse overheads
130
100
1000
10000
100000
1x106
1x107
1x108
getpi
d
gettim
eofda
yfwrite sta
trea
d
getcw
dop
enclo
semmap
Run
time
(cyc
les)
VirtualMultiverse
system call overheads
~40µs