51
CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)

A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

CARSTEN WEINHOLD, TU DRESDEN

A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING

The Hebrew University of Jerusalem

Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)

Page 2: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing

The ideal world assumption:

■ Identical, predictable, and reliable nodes

■ Fast and reliable interconnect

■ Balanced applications

■ Isolated partitions of fixed size

2

TRADITIONAL HPC

Page 3: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 3

TRADITIONAL HPCTim

e

Fixed-size chunks of work

One thread per core

Page 4: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing

Systems software:optimize communication latency

Message passing uses polling

Batch scheduler for start / stop

Separate servers for I/O

Small OS on each node

No OS on critical path

4

TRADITIONAL HPC

Page 5: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 5

LOAD

Application: CP2K

-100

0

100

200

300

400

500

600

-5 0 5 10 15 20 25 30 35 40 45 50

Proc

ess

ID

Timestep

tracing

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Com

puat

ion

time

(frac

tion)

0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

1

129

257

384

512

10 20 30 40 50

Proc

ess I

D

Timesteps

Computation—communication ratio of CP2K on 512 cores

Page 6: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 6

LOAD

Application: COSMO-SPECS+FD4

0 20 40 60 80

100 120

0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

1

33

65

96

128

30 60 90 120 150 180

Proc

ess I

D

Timesteps

Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores

Page 7: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 7

REALITY CHECK

Hand balanced compute times of ranks per time step

Application: COSMO-SPECS+FD4

Unbalanced compute times of ranks per time step

Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores

Page 8: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 8

REALITY CHECK

Application: PRIME

Now think of: • Composite applications • In-situ visualization, etc.

Unbalanced compute times of ranks per time step

Page 9: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 9

FFMK

SPPEXA – Findings & Goals

Massive parallelism (on- and cross-chip) requires fundamentally new concepts

Not “racks without brains”, but software is the key to this paradigm shift

Fundamental research ( DFG), in contrast to other (more application-oriented) initiatives ( German Federal Ministry of E & R)

Establish collaborative, interdisciplinary co-design of HPC applications and HPC methods

Focus on six research directions: Computational algorithms Application software Programming System software Data management and exploration Software tools

SPPEXA – Implementation

Two three-year funding phases

Overall budget of 3,7 M per year

Funded via DFG’s strategy fund

Interdisciplinary consortia of 3–5 groups

Consortia address at least two of SPPExa’s six research directions

Two-stage application process with (1) sketches and (2) full proposals

Global strategic coordination, following the established procedures of Collaborative Research Centres (SFB)

Close collaboration with respective inter-national programmes intended

w w w . s p p e x a . d e

SPPEXA – Chronology

2006: discussion in the German Research Foundation (DFG) on the necessity of a funding initiative for HPC software

2010: initiative out of German HPC commu-nity, referring to increasing activities on HPC software elsewhere (USA: NSF, DOE; Japan; China; G8)

2010: discussion with DFG’s Executive Committee, suggestion of a fl exible, strategically initiated SPP

2011: submission of the proposal, inter-national reviewing, and formal acceptance

2012: Review of project sketches and fullproposals

SPPEXA – Current Status

68 sketches handed in, overall volume of 19 M per year applied for

80 different universities, institutes and companies represented by 240 national and 15 international PIs

24 sketches invited for full proposals 13 full proposals accepted for funding Launch of programme and projects in

January 2013

German Priority Programme 1648“Software for Exascale Computing”

Boris

Leh

ner f

ür H

LRS

Image: Heiner Igel

Image: Wolfgang E. Nagel

FFMK: A Fast and Fault-Tolerant Microkernel-Based Operating System for Exascale Computing

Page 10: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 10

CHALLENGES

Dynamic applications& platforms

Increased fault rates

Power / Dark silicon

Heterogeneity (cores, memory, … )

FFMK-OS FFMK-OS FFMK-OS

FFMK-OS FFMK-OS FFMK-OS

FFMK-OS FFMK-OS FFMK-OS

FFMK-OS FFMK-OS FFMK-OS

Page 11: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 11

NODE ARCHITECTURE

Service OS

Runtime

ApplicationApplicationApplication

Light-weight Kernel (L4)

Service coresCompute cores

Monitoring

Platform Management

Runtime Support

Linux (drivers, etc.)

Page 12: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 12

3 ABSTRACTIONS

Address space

Address

space

Light-weight Kernel (L4)

ThreadsThreads

Page 13: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 13

MESSAGE PASSING

App App

Light-weight Kernel (L4)

File System

Network I/O

Device Interrupt

Page 14: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 14

BLOCK, POLL, IRET

• Intel Core i7 3770S @ 3100 MHz • No Hyperthreading, no Turboboost • No dynamic power management • Same socket

Block (Linux)Block (L4)

Polling

0 µs 1 µs 2 µs 3 µs 4 µs 5 µs 6 µs

0,1

3,2

5,3

• 64-bit Linux 3.11.6 (cpuidle.off=0, intel_idle.max_cstate=0)

• 64-bit L4/Fiasco.OC

Wake from interrupt on L4: 900 cycles, 0.3 µs

(best case, on Intel Core i7-4770 CPU @ 3.40GHz)

Page 15: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 15

LINUX ON L4

Linux OS

Linux App

File Network I/O

Light-weight Kernel (L4)

Page 16: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 16

HYBRID SYSTEM

Real-time

Security: small Trusted Computing Base

Resilience: small Reliable Computing Base

uncritical complex

critical simple

Page 17: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 17

HYBRID SYSTEM

uncritical complex

critical simple

File Network I/O

Service OS

Light-weight Kernel (L4)

Page 18: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

Service OS

A Microkernel-Based OS for Exascale Computing 18

HYBRID SYSTEM

Service Proxy

File Network I/O

MPI App

MPI Lib

Infiniband

Light-weight Kernel (L4)

Page 19: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 19

DRIVER REUSE

L4 App

libibverbsUser-space Driver

Linux Kernel

IB Core

Kernel Driver

/dev/ib0I/O

Proxy App

Light-weight Kernel (L4)

Msg Buffer 1

Msg Buffer 2

Msg Buffer 1

Msg Buffer 2

I/O

Page 20: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 20

NODE ARCHITECTURE

Service OS

Runtime ApplicationApplication

Runtime Linux Kernel

ApplicationApplicationApplication

Light-weight Kernel (L4)

ProxiesMPI Library

MPI

Infiniband Infiniband

Service coresCompute cores

Chkpt.

Checkpoint

Page 21: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 21

FAILURE RATES

Kento Sato et al., „Design and Modeling of a Non-blocking Checkpointing System“, SC’12

MTBF for Component Failures in an HPC System

Failed Component # of Nodes Affected MTBF

PFS, core switch 1408 65.10 days

Rack 32 86.90 days

Edge switch 16 17.37 days

PSU 4 28.94 days

Compute nodes 1 15.8 hours

Page 22: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing

In-memory XtreemFS volume for application-level checkpointing

RAID-5 erasure coding: recovery with 1 failed OSD

Demonstrator running BQCD code on a Cray XC30

22

XTREEMFS

DIR

MRC

OSD OSD OSD OSD

Application (BQCD)

Page 23: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 23

BANDWIDTH

1

10

100

1000

10000

100000

16

32

64

128

256

512

1024

2048

4096

8192

GB/s

Processes

memcpy CRUISE ramdisk

(a) Zin Cluster (Linux)

0.1

1

10

100

1000

10000

1K 2K 4K 8K 16K 32K 64K 96K

TB/s

Nodes

memcpy 64ppn CRUISE 32ppn CRUISE 64ppn CRUISE 16ppn ramdisk 16ppn

(b) Sequoia Cluster (IBM Blue Gene/Q)

Figure 8: Aggregate Bandwidth Scalability of CRUISE

7.4 Large-Scale EvaluationCRUISE is designed to be used with large-scale clusters that

span thousands of compute nodes. We evaluated the scalingcapacity of this framework, and we show the results in Fig-ure 8. We conducted these evaluations on Zin and Sequoia.For each of these clusters, we measured the throughput ofCRUISE with increasing number of processes. In these experi-ments, we configured CRUISE to allocate a persistent memoryblock per process. On Zin, each process writes a 128MB file;on Sequoia, each writes a 50MB file. We compare CRUISE toRAM disk and a memory-to-memory copy of data withina process’ address space using memcpy ). Since CRUISE re-quires at least one memory copy to move data from theapplication bu er to its in-memory file storage, the memcpy

performance represents an upper-bound on throughput.On Zin (Figure 8(a)), the number of processes writing to

CRUISE was increased by a factor of two up to 8,192 processesalong the x-axis. The y-axis shows the bandwidth(GB/s) inlog-scale. As the graphs indicate, a perfect-linear scalingcan be observed on this cluster. Furthermore, CRUISE takescomplete advantage of the memory system’s bandwidth (theCRUISE plot overlaps the memcpy plot). The throughput ofCRUISE at 8,192 processes is 17.6 TB/s, which is only slightlybelow the memcpy throughput of 17.7 TB/s. The through-put of RAM disk is nearly half that of CRUISE at 9.87 TB/s.These runs used 17.5% of the available compute nodes. Ex-trapolation of this linear scaling to the full 46,656 processeswould lead to a throughput for CRUISE of over 100 TB/s.

Figure 8(b) shows the scaling trends on Sequoia. BecauseSequoia is capable of 4-way simultaneous multi-threading,a total of 6,291,456 parallel tasks can be executed. The x-axis provides the node-count for each data point, and the y-axis shows the bandwidth(TB/s) in log-scale. For clarity, weonly show the configurations that deliver the best results foraligned memcpy and RAM Disk. We show the results whenusing 16, 32, and 64 processes per node for CRUISE. At thefull-system scale of 6 million processes (64 processes/node),the aggregate aligned memcpy bandwidth reaches 1.21 PB/s.As observed in Figure 7(b), CRUISE nearly saturates thisbandwidth to deliver a throughput of 1.16 PB/s when run-ning with 32 processes per node. This is 20x faster than the

system RAM disk, which provides a maximum throughputof 58.9 TB/s, and it is 1000x faster than the 1 TB/s parallelfile system provided for the system.

8. REL TEDWORKLinker support to intercept library calls has been around

for a while. Darshan [8] intercepts an HPC application’scalls to the file system using linker support to profile andcharacterize the application’s I/O behavior. Similarly, fake-chroot [2] intercepts chroot ) and open ) calls to emulatetheir functionality without privileged access to the system.

Other researchers have investigated saving files in memoryfor performance. The MemFS project from Hewlett Packard[19] dynamically allocates memory to hold files. However,there is no persistence of the files after a process dies andMemFS requires kernel support. McKusick et al. presentan in-memory file system [16]. This e ort also requires ker-nel support, and it requires copies from kernel bu ers toapplication bu ers which would cause high overhead.

MEMFS is a general purpose, distributed file system im-plemented across compute nodes on HPC systems [27]. Un-like our approach, they do not optimize for the predominantform of I/O on these systems, checkpointing. Another gen-eral purpose file system for HPC is based on a concept calledcontainers which reside in memory [15]. While this workdoes consider optimizations for checkpointing, its focus ison asynchronous movement of data from compute nodes toother storage devices in the storage hierarchy of HPC sys-tems. Our work primarily di ers from these in that CRUISE isa file system optimized for fast node-local checkpointing.

Several e orts investigated checkpointing to memory in amanner similar to that of SCR [7, 9, 13, 23, 29, 30]. Theyuse redundancy schemes with erasure encoding for higherresilience. These works di er from ours in that they usesystem-provided in-memory or node-local file systems, suchas RAM disk, to store checkpoints. Rebound checkpoints tovolatile memory but focuses on single many-core nodes andoptimizes for highly-threaded applications [6].

Raghunath Rajachandrasekar et al., A 1 PB/s File System to Checkpoint Three Million MPI Tasks, HPDC’13

Page 24: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 24

NODE ARCHITECTURE

Service OS

Runtime ApplicationApplication

Runtime Linux Kernel

ApplicationApplicationApplication

Light-weight Kernel (L4)

ProxiesMPI Library

MPI

Platform Management

Infiniband Infiniband

Service coresCompute cores

Chkpt.

Checkpoint

Page 25: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 25

OVERDECOMPOSITION

processing units

time

Barrier

Page 26: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 26

OVERDECOMPOSITION

Barrier

processing units

time

Page 27: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 27

OVERDECOMPOSITION

processing units

time

Barrier

Page 28: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 28

OVERDECOMPOSITION

Barrier

processing units

time

Page 29: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 29

OVERDECOMPOSITION

0 s

200 s

400 s

600 s

800 s

1000 s

Baseline 2x 4x 8x 16x

COSMO-SPECS+FD4 (unbalanced, HT)

Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)

Page 30: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 30

BUSY WAITING

Busy waiting = Computation

Page 31: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 31

OVERDECOMPOSITION

0 s

200 s

400 s

600 s

800 s

1000 s

Baseline 2x 4x 8x 16x

COSMO-SPECS+FD4 (unbalanced, HT)

Polling (b

usy waiting)

Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)

Page 32: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 32

OVERDECOMPOSITION

0 s

50 s

100 s

150 s

200 s

250 s

300 s

Baseline 2x 4x 8x 16x

COSMO-SPECS+FD4 (balanced, no HT)COSMO-SPECS+FD4 (unbalanced, no HT)

Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)

Page 33: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 33

OVERDECOMPOSITION

0 s50 s

100 s150 s200 s250 s300 s350 s

Baseline 2x 4x 8x 16x

CP2K (unbalanced, no HT)

Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)

Page 34: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 34

OVERDECOMPOSITION

Barrier

processing units

time

With MPI: • Do not: busy wait (except very shortly) • Do: Block in kernel • Needs: fast unblocking of threads, when message comes in • We build: shortcut from IB driver into MPI threads (no Linux!)

Page 35: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 35

CENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Page 36: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 36

DECENTRALIZED

Node 1

Node 2

Node n

…….

..

..

..

..

..

..

..

..

…….

Page 37: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 37

DECENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Node 1

Node 2

Node n

Page 38: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 38

DECENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Node 1

Node 2

Node n

Page 39: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 39

GOSSIP SCALABILITY

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s

Figure 3: PTRANS performance in GB/s (higher is better).

0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s

Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.

0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s

Figure 3: PTRANS performance in GB/s (higher is better).

0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s

Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.

0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.

MPI-FFT running on BG/Q “JUQUEEN”

Low overhead: No noticeable overhead at gossip interval of 64–256 ms

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s

Figure 3: PTRANS performance in GB/s (higher is better).

0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s

Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.

0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s

Figure 3: PTRANS performance in GB/s (higher is better).

0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s

Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.

0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.COSMO-SPECS+FD4 on BG/Q “JUQUEEN“

Quality of information: Average age at nodes in the order of 2–3 s with gossip interval of 256 ms

E. Levy, A. Barak, A. Shiloh, M. Lieber, C. Weinhold, and H. Härtig, „Overhead of a Decentralized Gossip Algorithm on the Performance of HPC Applications“, ROSS 2014

Page 40: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 40

GOSSIP VS FAULTSTable 6: Simulations of the average master’s age with failed nodes.

Number of Circulating localfailed nodes windows of sizeper colony 16 32 64 128 256

0 11.74 9.67 8.66 8.20 8.071 11.71 9.72 8.67 8.21 8.072 11.75 9.68 8.70 8.21 8.084 11.81 9.73 8.70 8.23 8.118 11.83 9.79 8.72 8.28 8.1716 11.95 9.90 8.79 8.34 8.2032 12.12 10.05 8.96 8.48 8.36

Standard deviation 0.49 0.42 0.37 0.36 0.36Increase rate 3.2% 3.9% 3.5% 3.4% 3.6%

5.2.2 Using multiple masters

For increased resilience, the algorithm must continue to work even when the master fails. We

therefore extended the push algorithm to have several masters, where colony nodes send their

global windows to a randomly selected master. In addition, the masters exchange their latest

information, whereby each master sends one master window to a randomly selected master each

unit of time (not necessarily the same as the unit of time of the colonies). The master window

contains a fixed number of the youngest entries per colony.

Given a cluster with m masters and assuming that each colony node sends a global window with

probability kn , on average each master receives k

m global windows from each colony each unit of time.

The analysis performed for a single master can now be used by multiplying k in the single-master

analysis by m, the number of masters.

We verified the feasibility of the above extensions by running simulations with various sets of

parameters and found that the exchange of information between the masters significantly improves

the average master-age. Both the size and the frequency of sending master windows contributed

to this improvement.

Table 7 shows the simulation results of the average master-age with different numbers of masters

and master-failures. For this table, the following parameters were selected: colony size n = 1, 024

nodes; local windows of 64 entries; global windows of quarter-colony size, i.e. 256 entries and master

windows of the same size per colony; update rate k = m; and the masters’ unit of time is identical

to that of the colonies. Each result in Table 7 shows the average of 500 simulations.

The columns of Table 7 show that the penalty when masters fail is not too high: for example

only 17.1% when 4 out of 8 masters fail. This demonstrates the resilience of the extended algorithm.

The main diagonal, which presents the average master-age when only one node is active, is almost

fixed for any number of nodes.

An interesting observation from Table 7 is that the average master-age always decreases as the

number of masters increase. We would expect the average master-age to drastically increase as

21

Average age at master (1024 nodes per colony)Gossip is fault tolerant: Only slight increase in average age when substantial number of nodes fail (up to 32 of 1024 in each colony)

A. Barak, Z. Drezner, E. Levy, M. Lieber, and A. Shiloh, „Resilient gossip algorithms for collecting online management information in exascale clusters“, Concurrency and Computation: Practice and Experience, 2015

Page 41: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 41

DECENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Node 1

Node 2

Node n

Page 42: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

When MOSIX: Load difference discovered

Where MOSIX: Memory, cycles, comm

Which MOSIX: Past predicts future

A Microkernel-Based OS for Exascale Computing 42

DECENTRALIZED

Node 1

Node 2

Node n

When

+ Anomaly anticipated

Where + Topology

Which + Application knowledge

Page 43: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 43

NODE ARCHITECTURE

Service OS

Runtime ApplicationApplication

Runtime Linux Kernel

ApplicationApplicationApplication

Light-weight Kernel (L4)

ProxiesMPI Library

MPI

Platform Management

Decision Making

Gossip

Infiniband InfinibandMonitor Mon

itor

Service coresCompute cores

Chkpt.

Checkpoint

Page 44: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 44

MANAGEMENT

Decision Making

Hardware monitoring

MigrationPlatform info (Gossip)

Resource Prediction

Application hints

Past behavior

Future behavior?

Page 45: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 45

APPLICATION HINTS

Resource prediction: CPU cycles Cache misses Memory Energy?

Page 46: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 46

THERMAL PLANNING

processing units

time

Barrier

Page 47: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 47

THERMAL PLANNING

Barrier

processing units

time

Page 48: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 48

THERMAL PLANNING

Barrier

processing units

time

Page 49: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing

Resource prediction: CPU cycles Cache misses Memory Energy?

Fault tolerance: Which data to checkpoint (and when) Ability to handle node failures

49

APPLICATION HINTS

Page 50: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing 50

NODE ARCHITECTURE

Service OS

Runtime ApplicationApplication

Runtime Linux Kernel

ApplicationApplicationApplication

Light-weight Kernel (L4)

ProxiesMPI Library

MPI

Platform Management

Decision Making

Gossip

Infiniband InfinibandMonitor Mon

itor

Service coresCompute cores

Chkpt.

Checkpoint

Page 51: A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

The Hebrew University of Jerusalem

A Microkernel-Based OS for Exascale Computing

FFMK prototype runs on real HPC system

We build OS/R for Exascale: Load management Fault tolerance

Take burden from application developers

51

SUMMARY

ffmk.tudos.org