Enabling Hybrid Parallel Runtimes Through Kernel and ...v3vee.org/talks/vee16.pdf · 97 0 500000...

Preview:

Citation preview

Enabling Hybrid Parallel Runtimes Through Kernel and Virtualization

Support

Kyle C. Hale and Peter Dinda

2

Hybrid Runtimes

runtime not limited to abstractions exposed by syscall interface

HPDC ‘15

opportunity for leveraging privileged HW features

the runtime IS the kernel

3

user-mode

kernel-mode

CURRENT OS/R MODEL

PARALLEL RUNTIME

GENERAL PURPOSE OS

PARALLEL APP

HARDWARE

4

user-mode

kernel-mode

PARALLEL RUNTIME

GENERAL PURPOSE OS

PARALLEL APP

HARDWARE HYBRID RUNTIME

PARALLEL APP

HARDWAREkernel mashup of OS and runtime

THE HYBRID RUNTIME

5

6

Racket

Legion

NDPC

NESL

7

Spee

dup

over

Lin

ux

1

1.1

1.2

1.3

1.4

Legion Processor Count (Cores)

1 2 4 8 16 32 64 128 200 220

Xeon Phi x64

8

Spee

dup

over

Lin

ux

1

1.1

1.2

1.3

1.4

Legion Processor Count (Cores)

1 2 4 8 16 32 64 128 200 220

Xeon Phi x64

1.19x speedup over

linux

9

Spee

dup

over

Lin

ux

1

1.1

1.2

1.3

1.4

Legion Processor Count (Cores)

1 2 4 8 16 32 64 128 200 220

Xeon Phi x64

1.37x speedup over

linux

1.19x speedup over

linux

10

TWO ENABLING TOOLS

11

kernel framework

12

HVMTHE HYBRID VIRTUAL MACHINE

bridge HRT with legacy OS

OUTLINE

13

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

14

PARALLEL APP

RUNTIME

THREADS SYNC PAGING EVENTS HW INFO BOOTSTRAP TIMERS IRQS CONSOLE

user-mode

kernel-mode

HARDWARE

Nautilus primitives & utilities (HRT can use or not use any of them)

aerokernel

NAUTILUS

15

Nautilus under the hood

16

Nautilus

Parallel Runtime System

17

kernel primitives should be SIMPLE and FAST

runtime developer can easily reason about them!

18

start with familiar interfaces

condition variables

threads

mutexes/locks

fork/join parallelism

memory management

19

Parallel Runtime System

logical CPUs

map runtime’s logical view of machine onto physical HW

20

threadsunified, shared address space

threads can operate preemptively or cooperatively*

*for runtimes that require more determinism

21

thread creation is FASTC

ycle

s

0

1750000

3500000

5250000

7000000

2 4 8 16 32 64

NautilusLinux (pthreads)

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

22

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

Cyc

les

0

1750000

3500000

5250000

7000000

2 4 8 16 32 64

Linux (pthreads)

thread creation is FAST

23

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

Cyc

les

0

1750000

3500000

5250000

7000000

2 4 8 16 32 64

NautilusLinux (pthreads)

~3ms

~90µs

thread creation is FAST

24

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

user-mode software events are SLOW

hardware (IPI)

lower is better

0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. futex

software events

25

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

nautilus events triggers are FASTER

hardware (IPI)0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. naut. cond. var.futex

naut. cond. var.

w/IPI

26TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo

27TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo~100 cycles

28

memory

29

static identity map

bootup

1GB

.

.

.

2MB

.

.

.

physical memory

page size

eliminates expensive page faults

reduces TLB misses + shootdowns

increases performance under virtualization

30

31

NUMAfirst-touch

stack

NUMA domain NUMA domainruntime has FULL control over thread placement and

memory layout

32

philix

33

hostXeon Phi

philix host utility

Intel MPSS stack

custom OS

virtual console

34

Nautilus 25,000 lines of mostly C

Legion 43,000 lines of C++

Additions for Legion 800 lines of C/C++

Additions for Xeon Phi 1350 lines of C

Nautilus is fairly small

35

What do we lose?

36

familiar environment

37driver ecosystem

38

protection/isolation

OUTLINE

39

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

40

HARDWARE

HRT

APPLICATION

DEDICATED

full HW access

41

HARDWARE

HRT

APPLICATION

PARTITIONED

full HW access

CORE 0

LINUX STACK

… CORE K-1 CORE K CORE N-1…

42

HARDWARE

HRT

PARALLEL APP

VM

LINUX STACK

VMM

VIRTUAL MACHINE 0 VIRTUAL MACHINE 1

43

HARDWARE

HRT

PARALLEL APP

Hybrid Virtual Machine

LINUX STACK

VMM

HVMregular OS vcores HRT vcores

OUTLINE

44

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

45

HRT

user-mode

kernel-mode

user-mode

kernel-mode

PARALLEL RUNTIME

GENERAL PURPOSE OS

PARALLEL APP

PARALLEL APP

NODE HARDWARE NODE HARDWARE

HVM

NODE HARDWARE

GENERAL VIRTUALIZATION MODEL

SPECIALIZED VIRTUALIZATION MODEL

regular OS (ROS)

performance pathlegacy functionality from ROS via HVM

46

ROS vcore

ROS vcore

HRT vcore

Virtual Machine

IPI IPI

ROS/HRT setup

HRT vcore

IPI

IPI

47

ROS vcore

ROS vcore

HRT vcore

VM memory

Virtual Machine

ROS/HRT setup

ROS visible memory

HRT vcore

48

ROS vaddr space HRT vaddr space

merged address spacehigher half

ROS kernel

app + runtime code & data

higher half Aerokernel

app + runtime code & data

merged

49ROS/HRT communication

LINUX

HYPERVISOR + HVM

HRTHRT reboot requestAddress space mergerHRT function invocation

shared data page

50ROS/HRT communication

LINUX

HYPERVISOR + HVM

HRTSynchronous operation request

shared data page

comm. area

51

Cycles Time

Addr space merge ~33K 15µs

Asynch function invocation ~25K 11µs

Synch function invocation (remote

socket)~1060 482ns

Synch function invocation (same

socket)~790 359ns

ROS/HRT communication

52ROS boot

ROS BSPINIT-SIPI-SIPI

INIT-SIPI-SIPI

INIT-SIPI-SIPI

BOOTSTRAP

BOOTSTRAP

BOOTSTRAP

ROS APs

53HRT boot

HVM sets cores up in long mode*

*no need for INIT-SIPI-SIPI sequence

HVM sets up registers, system tables

GDT

IDT

TSS

REGISTERS

INIT PAGE

TABLES

HVM boots cores simultaneously

54

LINUX FORK + EXEC ~ 714µs

HVM + HRT CORE BOOT ~ 61µs

55

how do we take a legacy runtime to

the HRT + X model?

PORT

56porting a runtime/app to an a new OS

environment is…

DIFFICULT TIME-CONSUMING

ERROR-PRONE

57

development cycle:

ADD FUNCTION

REBUILD

BOOT HRT

do{

}while(HRTfallsover)

58

much of the functionality is

NOT ON THE CRITICAL PATH

59

we want to make this

easier

OUTLINE

60

Background/Overview

Nautilus

Deployment Models

Hybrid Virtual Machine

Multiverse & Future Work

61

give us your legacy runtime

62

we automatically transform it to run

as an HRT

bridged with a legacy OS in kernel mode

(AUTOMATIC HYBRIDIZATION)

63

rebuild with our toolchain

LEGACY APP/RUNTIME

64

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

AEROKERNEL BINARY

LINUX

HYPERVISOR + HVM

runs like normal

application

BOOTED AEROKERNEL

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

kernel-mode

65

66

bridge HRT with legacy OSHybrid Virtual Machine automatic transformation: legacy app+runtime HRT

Multiverse

4 parallel runtimes (Legion, Racket, NDPC, NESL)

67

thank you

68

my webpage: halek.co lab: presciencelab.org download nautilus: nautilus.halek.co v3vee project: v3vee.org

69

backups

70

thread fork

71

interrupt driven execution

72

memory and paging

73

static identity map

bootup

1GB

.

.

.

2MB

.

.

.

physical memory

page size

eliminates expensive page faults

reduces TLB misses + shootdowns

increases performance under virtualization

74

75Un

ified

TLB

Miss

es

1

1000

1000000

log scale

76Un

ified

TLB

Miss

es

1

1000

1000000

Linux Nautilus

77

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

thread creation is PREDICTABLE

1000

10000

100000

1e6

1e7

1e8

cycl

es

Linux (pthreads) Nautilus

log scale

lower is better

kernel threads

78

binding to physical processor is GUARANTEED

79

RUNTIME control overvirtual CPUphysical

80

interrupts &

events

81

asynchronous notifications

82

EVENT COMPLETEStrigger

DEPENDENT TASK EXECUTES

trigger latency

83

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

user-mode software events are SLOW

hardware (IPI)

lower is better

0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. futex

84

x86_64 Opteron: 64 cores, 4 sockets, 8 numa zones, 128GB RAM

nautilus events triggers are FASTER

hardware (IPI)0

5000

10000

15000

20000

25000

trig

ger

late

ncy

(cyc

les)

30000

35000

cond. var. naut. cond. var.futex

naut. cond. var.

w/IPI

85

handle_event()handle_event()

86

asynchronous, IPI-based execution

87

NOT POSSIBLE IN LINUX USERSPACE

88TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo

lower is better

89TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo

90TR

IGG

ER L

ATEN

CY (C

YCLE

S)

0180360540720900

10801260144016201800

IPI Nemo~100 cycles

91

this gives us an

incremental path for creating HRTs

92

start out with a working

HRT system

93

then pull functionality

into the HRT for hot spots

94

RACKET

most widely used Scheme implementation

downloaded ~300 times/day

complex runtime with JIT

95

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

AEROKERNEL BINARY

LINUX

HYPERVISOR + HVM

BOOTED AEROKERNEL

LEGACY APP/RUNTIME

MULTIVERSE RUNTIME LAYER

system call

system call

ROS/HRT communication

96

97

0

500000

1×106

1.5×106

2×106

2.5×106

pthread condvar

futex wakeup

Aerokernel condvar

Aerokernel condvar + IPI

broadcast IPI

µ = 995795

min = 17538max = 2.17277e+06

σ = 544512

µ = 370630

min = 16402max = 1.89553e+06

σ = 199680

µ = 265820

min = 3258max = 612959

σ = 159421

min = 7842max = 464015µ = 132417σ = 98637.4

min = 1252max = 57467µ = 12827.3σ = 2931.32

Cyc

les

to W

akeu

p

Broadcast wakeups on x64

Nautilus events

98

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

Wakeup deviation on x64

99

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

HW lower bound

user-modeevents

Nemo events(kernel-mode)

2x

Wakeup deviation on x64

Nautilus events

100

0

500000

1×106

1.5×106

2×106

2.5×106

pthrea

d con

dvar

futex

wakeu

p

broad

cast

IPI

µ = 995795

min = 17538max = 2.17277e+06

σ = 544512

µ = 370630

min = 16402max = 1.89553e+06

σ = 199680

µ = 12827.3

min = 1252max = 57467

σ = 2931.32

Cyc

les

to W

akeu

p

Broadcast wakeups on x64

101

0

0.2

0.4

0.6

0.8

1

710 720 730 740 750 760 770 780 790 800

CD

F

Cycles mesaured from BSP (core 224)

95th percentile = 746 cycles

Unicast IPIs on phi

102

0

0.2

0.4

0.6

0.8

1

2080 2100 2120 2140 2160 2180 2200

CD

F

Cycles mesaured from BSP (core 224)

95th percentile = 2105 cycles

Roundtrip IPIs on phi

103

0

0.2

0.4

0.6

0.8

1

15 15.5 16 16.5 17 17.5 18 18.5

CD

F

σ

95th percentile σ = 31 cycles

Multicast IPIs on phi

104

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

Wakeup deviation on x64

105

0

0.2

0.4

0.6

0.8

1

1000 10000 100000 1×106

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

HW lower bound

user-mode events

~70x

Wakeup deviation on x64

106

0

1000

2000

3000

4000

5000

6000

7000

Linux (pthreads) Aerokernel threads

µ = 1412.67

min = 1352max = 2438

σ = 114.377

µ = 1269.52

min = 1176max = 1391

σ = 48.3902

Cyc

les

thread context switch on x64

107thread create + launch on phi

10000

100000

1×106

1×107

Linux (pthreads) Linux (kernel threads) Nautilus kernel threads

Cyc

les

108

0

2×106

4×106

6×106

8×106

1×107

1.2×107

2 4 8 16 32 64

Cyc

les

Threads

Linux pthreadsLinux kernel threads

Nautilus threads

thread create + launch (many threads) on x64

109

0

2×107

4×107

6×107

8×107

1×108

1.2×108

1.4×108

1.6×108

2 4 8 16 32 64 128 228

Cyc

les

Threads

Linux pthreadsLinux kernel threads

Nautilus threads

thread create + launch (many threads) on phi

110Unicast IPIs on x64

0

0.2

0.4

0.6

0.8

1

1000 1200 1400 1600 1800 2000 2200 2400

CD

F

Cycles mesaured from BSP (core 0)

95th percentile = 1728 cycles

111

0

0.2

0.4

0.6

0.8

1

3600 3800 4000 4200 4400 4600 4800 5000

CD

F

Cycles mesaured from BSP (core 0)

95th percentile = 4640 cycles

Roundtrip IPIs on x64

112

0

0.2

0.4

0.6

0.8

1

2000 3000 4000 5000 6000 7000 8000

CD

F

σ

95th percentile σ = 2812 cycles

Multicast IPIs on x64

113

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)

Unicast IPI vs memory polling

114

ipi sync cdf (this one has annotations)

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)

coherence network

interruptnetwork

Unicast IPI vs memory polling

115

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)projected remote syscall

Unicast IPI vs memory polling

116

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

Cycles mesaured from BSP (core 0)

unicast IPIsynchronous event (mem polling)projected remote syscall

cost of dest.handling

Unicast IPI vs memory polling

117

pthrea

d con

dvar

futex

wakeu

p

unica

st IPI

µ = 25722.7

min = 24333max = 27834σ = 618.407

µ = 15637.4

min = 13311max = 22343σ = 1173.37

µ = 740.85min = 732max = 761σ = 5.71205

Single wakeup on phi

118

wakeup phi many

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

Wakeup deviation on phi

119

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcast

broadcast IPI

HW lower bound

user-mode events

~79000x

Wakeup deviation on phi

120

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000 35000 40000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

Oneway IPI

Single wakeup on x64 (CDF)

121

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

Oneway IPI

Single wakeup on phi (CDF)

122

0

5000

10000

15000

20000

25000

30000

pthread condvar

futex wakeup

Aerokernel condvar

Aerokernel condvar + IPI

unicast IPI

µ = 25722.7min = 24333max = 27834σ = 618.407

µ = 15637.4min = 13311max = 22343σ = 1173.37

µ = 9014.73min = 5528max = 14074σ = 2507.14

min = 5953max = 7305µ = 6483.89σ = 213.517

min = 732max = 761µ = 740.85σ = 5.71205

Cyc

les

to W

akeu

p

Single wakeup on phi

123

0

2×106

4×106

6×106

8×106

1×107

1.2×107

pthread condvar

futex wakeup

Aerokernel condvar

Aerokernel condvar + IPI

broadcast IPI

µ = 5.64881e+06

min = 31743max = 1.17807e+07

σ = 3.08505e+06

µ = 2.25096e+06

min = 28545max = 4.55009e+06

σ = 1.27551e+06

µ = 1.02075e+06

min = 7199max = 7.25596e+06

σ = 1.1411e+06 min = 8267max = 3.93142e+06µ = 558155σ = 428849

min = 1016max = 1225µ = 1153.68σ = 17.1824

Cyc

les

to W

akeu

p

Many wakeups on phi

124

wakeup phi many wkernel cdf

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

Wakeup deviation on phi

125

0

0.2

0.4

0.6

0.8

1

10 100 1000 10000 100000 1×106 1×107

CD

F

σ

pthread condvarfutex broadcastAerokernel condvarAerokernel condvar + IPIbroadcast IPI

HW lower bound

Nemo events(kernel-mode)

user-mode events

4x

Wakeup deviation on phi

126

pthrea

d con

dvar

futex

wakeu

p

broad

cast

IPI

µ = 5.64881e+06

min = 31743max = 1.17807e+07

σ = 3.08505e+06

µ = 2.25096e+06

min = 28545max = 4.55009e+06

σ = 1.27551e+06

µ = 1153.68

min = 1016max = 1225

σ = 17.1824

Broadcast wakeups on phi

127

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000 35000 40000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

K.mode condvarK.mode condvar + IPI

Oneway IPI

Single wakeup on x64 (CDF)

128

0

0.2

0.4

0.6

0.8

1

0 5000 10000 15000 20000 25000 30000

CD

F

Cycles to Wakeup

U. mode condvarFutex wakeup

K.mode condvarK.mode condvar + IPI

Oneway IPI

Single wakeup on phi (CDF)

129

0 10 20 30 40 50 60 70 80 90

fannk

uch-r

edux

binary

-tree-2 fas

tafas

ta-3

nbod

y

spec

tral-n

orm

mande

lbrot-

2

Run

time

(s)

NativeVirtual

Multiverse

racket multiverse overheads

130

100

1000

10000

100000

1x106

1x107

1x108

getpi

d

gettim

eofda

yfwrite sta

trea

d

getcw

dop

enclo

semmap

Run

time

(cyc

les)

VirtualMultiverse

system call overheads

~40µs

Recommended