A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak

CARSTEN WEINHOLD, TU DRESDEN

A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING

The Hebrew University of Jerusalem

Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)


A Microkernel-Based OS for Exascale Computing

The ideal world assumption:

■ Identical, predictable, and reliable nodes

■ Fast and reliable interconnect

■ Balanced applications

■ Isolated partitions of fixed size

2

TRADITIONAL HPC


A Microkernel-Based OS for Exascale Computing 3

TRADITIONAL HPCTim

e

Fixed-size chunks of work

One thread per core



Systems software:optimize communication latency

Message passing uses polling

Batch scheduler for start / stop

Separate servers for I/O

Small OS on each node

No OS on critical path

4

TRADITIONAL HPC



LOAD

Application: CP2K

-100

0

100

200

300

400

500

600

-5 0 5 10 15 20 25 30 35 40 45 50

Proc

ess

ID

Timestep

tracing

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Com

puat

ion

time

(frac

tion)

0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

1

129

257

384

512

10 20 30 40 50

Proc

ess I

D

Timesteps

Computation—communication ratio of CP2K on 512 cores



LOAD

Application: COSMO-SPECS+FD4

0 20 40 60 80

100 120

0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Com

puta

tion

time

(frac

tion)

1

33

65

96

128

30 60 90 120 150 180

Proc

ess I

D

Timesteps

Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores



REALITY CHECK

Hand balanced compute times of ranks per time step

Application: COSMO-SPECS+FD4

Unbalanced compute times of ranks per time step

Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores



REALITY CHECK

Application: PRIME

Now think of: • Composite applications • In-situ visualization, etc.

Unbalanced compute times of ranks per time step



FFMK

SPPEXA – Findings & Goals

Massive parallelism (on- and cross-chip) requires fundamentally new concepts

Not “racks without brains”, but software is the key to this paradigm shift

Fundamental research ( DFG), in contrast to other (more application-oriented) initiatives ( German Federal Ministry of E & R)

Establish collaborative, interdisciplinary co-design of HPC applications and HPC methods

Focus on six research directions: Computational algorithms Application software Programming System software Data management and exploration Software tools

SPPEXA – Implementation

Two three-year funding phases

Overall budget of 3,7 M per year

Funded via DFG’s strategy fund

Interdisciplinary consortia of 3–5 groups

Consortia address at least two of SPPExa’s six research directions

Two-stage application process with (1) sketches and (2) full proposals

Global strategic coordination, following the established procedures of Collaborative Research Centres (SFB)

Close collaboration with respective inter-national programmes intended

w w w . s p p e x a . d e

SPPEXA – Chronology

2006: discussion in the German Research Foundation (DFG) on the necessity of a funding initiative for HPC software

2010: initiative out of German HPC commu-nity, referring to increasing activities on HPC software elsewhere (USA: NSF, DOE; Japan; China; G8)

2010: discussion with DFG’s Executive Committee, suggestion of a fl exible, strategically initiated SPP

2011: submission of the proposal, inter-national reviewing, and formal acceptance

2012: Review of project sketches and fullproposals

SPPEXA – Current Status

68 sketches handed in, overall volume of 19 M per year applied for

80 different universities, institutes and companies represented by 240 national and 15 international PIs

24 sketches invited for full proposals 13 full proposals accepted for funding Launch of programme and projects in

January 2013

German Priority Programme 1648“Software for Exascale Computing”

Boris

Leh

ner f

ür H

LRS

Image: Heiner Igel

Image: Wolfgang E. Nagel

FFMK: A Fast and Fault-Tolerant Microkernel-Based Operating System for Exascale Computing



CHALLENGES

Dynamic applications& platforms

Increased fault rates

Power / Dark silicon

Heterogeneity (cores, memory, … )

FFMK-OS FFMK-OS FFMK-OS






NODE ARCHITECTURE

Service OS

Runtime

ApplicationApplicationApplication

Light-weight Kernel (L4)

Service coresCompute cores

Monitoring

Platform Management

Runtime Support

Linux (drivers, etc.)



3 ABSTRACTIONS

Address space

Address

space


ThreadsThreads



MESSAGE PASSING

App App


File System

Network I/O

Device Interrupt



BLOCK, POLL, IRET

• Intel Core i7 3770S @ 3100 MHz • No Hyperthreading, no Turboboost • No dynamic power management • Same socket

Block (Linux)Block (L4)

Polling

0 µs 1 µs 2 µs 3 µs 4 µs 5 µs 6 µs

0,1

3,2

5,3

• 64-bit Linux 3.11.6 (cpuidle.off=0, intel_idle.max_cstate=0)

• 64-bit L4/Fiasco.OC

Wake from interrupt on L4: 900 cycles, 0.3 µs

(best case, on Intel Core i7-4770 CPU @ 3.40GHz)



LINUX ON L4

Linux OS

Linux App

File Network I/O




HYBRID SYSTEM

Real-time

Security: small Trusted Computing Base

Resilience: small Reliable Computing Base

uncritical complex

critical simple



HYBRID SYSTEM

uncritical complex

critical simple

File Network I/O

Service OS



Service OS


HYBRID SYSTEM

Service Proxy

File Network I/O

MPI App

MPI Lib

Infiniband




DRIVER REUSE

L4 App

libibverbsUser-space Driver

Linux Kernel

IB Core

Kernel Driver

/dev/ib0I/O

Proxy App


Msg Buffer 1

Msg Buffer 2

Msg Buffer 1

Msg Buffer 2

I/O



NODE ARCHITECTURE

Service OS

Runtime ApplicationApplication

Runtime Linux Kernel



ProxiesMPI Library

MPI

Infiniband Infiniband


Chkpt.

Checkpoint



FAILURE RATES

Kento Sato et al., „Design and Modeling of a Non-blocking Checkpointing System“, SC’12

MTBF for Component Failures in an HPC System

Failed Component # of Nodes Affected MTBF

PFS, core switch 1408 65.10 days

Rack 32 86.90 days

Edge switch 16 17.37 days

PSU 4 28.94 days

Compute nodes 1 15.8 hours



In-memory XtreemFS volume for application-level checkpointing

RAID-5 erasure coding: recovery with 1 failed OSD

Demonstrator running BQCD code on a Cray XC30

22

XTREEMFS

DIR

MRC

OSD OSD OSD OSD

Application (BQCD)

…



BANDWIDTH

1

10

100

1000

10000

100000

16

32

64

128

256

512

1024

2048

4096

8192

GB/s

Processes

memcpy CRUISE ramdisk

(a) Zin Cluster (Linux)

0.1

1

10

100

1000

10000

1K 2K 4K 8K 16K 32K 64K 96K

TB/s

Nodes

memcpy 64ppn CRUISE 32ppn CRUISE 64ppn CRUISE 16ppn ramdisk 16ppn

(b) Sequoia Cluster (IBM Blue Gene/Q)

Figure 8: Aggregate Bandwidth Scalability of CRUISE

7.4 Large-Scale EvaluationCRUISE is designed to be used with large-scale clusters that

span thousands of compute nodes. We evaluated the scalingcapacity of this framework, and we show the results in Fig-ure 8. We conducted these evaluations on Zin and Sequoia.For each of these clusters, we measured the throughput ofCRUISE with increasing number of processes. In these experi-ments, we configured CRUISE to allocate a persistent memoryblock per process. On Zin, each process writes a 128MB file;on Sequoia, each writes a 50MB file. We compare CRUISE toRAM disk and a memory-to-memory copy of data withina process’ address space using memcpy ). Since CRUISE re-quires at least one memory copy to move data from theapplication bu er to its in-memory file storage, the memcpy

performance represents an upper-bound on throughput.On Zin (Figure 8(a)), the number of processes writing to

CRUISE was increased by a factor of two up to 8,192 processesalong the x-axis. The y-axis shows the bandwidth(GB/s) inlog-scale. As the graphs indicate, a perfect-linear scalingcan be observed on this cluster. Furthermore, CRUISE takescomplete advantage of the memory system’s bandwidth (theCRUISE plot overlaps the memcpy plot). The throughput ofCRUISE at 8,192 processes is 17.6 TB/s, which is only slightlybelow the memcpy throughput of 17.7 TB/s. The through-put of RAM disk is nearly half that of CRUISE at 9.87 TB/s.These runs used 17.5% of the available compute nodes. Ex-trapolation of this linear scaling to the full 46,656 processeswould lead to a throughput for CRUISE of over 100 TB/s.

Figure 8(b) shows the scaling trends on Sequoia. BecauseSequoia is capable of 4-way simultaneous multi-threading,a total of 6,291,456 parallel tasks can be executed. The x-axis provides the node-count for each data point, and the y-axis shows the bandwidth(TB/s) in log-scale. For clarity, weonly show the configurations that deliver the best results foraligned memcpy and RAM Disk. We show the results whenusing 16, 32, and 64 processes per node for CRUISE. At thefull-system scale of 6 million processes (64 processes/node),the aggregate aligned memcpy bandwidth reaches 1.21 PB/s.As observed in Figure 7(b), CRUISE nearly saturates thisbandwidth to deliver a throughput of 1.16 PB/s when run-ning with 32 processes per node. This is 20x faster than the

system RAM disk, which provides a maximum throughputof 58.9 TB/s, and it is 1000x faster than the 1 TB/s parallelfile system provided for the system.

8. REL TEDWORKLinker support to intercept library calls has been around

for a while. Darshan [8] intercepts an HPC application’scalls to the file system using linker support to profile andcharacterize the application’s I/O behavior. Similarly, fake-chroot [2] intercepts chroot ) and open ) calls to emulatetheir functionality without privileged access to the system.

Other researchers have investigated saving files in memoryfor performance. The MemFS project from Hewlett Packard[19] dynamically allocates memory to hold files. However,there is no persistence of the files after a process dies andMemFS requires kernel support. McKusick et al. presentan in-memory file system [16]. This e ort also requires ker-nel support, and it requires copies from kernel bu ers toapplication bu ers which would cause high overhead.

MEMFS is a general purpose, distributed file system im-plemented across compute nodes on HPC systems [27]. Un-like our approach, they do not optimize for the predominantform of I/O on these systems, checkpointing. Another gen-eral purpose file system for HPC is based on a concept calledcontainers which reside in memory [15]. While this workdoes consider optimizations for checkpointing, its focus ison asynchronous movement of data from compute nodes toother storage devices in the storage hierarchy of HPC sys-tems. Our work primarily di ers from these in that CRUISE isa file system optimized for fast node-local checkpointing.

Several e orts investigated checkpointing to memory in amanner similar to that of SCR [7, 9, 13, 23, 29, 30]. Theyuse redundancy schemes with erasure encoding for higherresilience. These works di er from ours in that they usesystem-provided in-memory or node-local file systems, suchas RAM disk, to store checkpoints. Rebound checkpoints tovolatile memory but focuses on single many-core nodes andoptimizes for highly-threaded applications [6].

Raghunath Rajachandrasekar et al., A 1 PB/s File System to Checkpoint Three Million MPI Tasks, HPDC’13



NODE ARCHITECTURE

Service OS





ProxiesMPI Library

MPI

Platform Management

Infiniband Infiniband


Chkpt.

Checkpoint



OVERDECOMPOSITION

processing units

time

Barrier



OVERDECOMPOSITION

Barrier

processing units

time



OVERDECOMPOSITION

processing units

time

Barrier



OVERDECOMPOSITION

Barrier

processing units

time



OVERDECOMPOSITION

0 s

200 s

400 s

600 s

800 s

1000 s

Baseline 2x 4x 8x 16x

COSMO-SPECS+FD4 (unbalanced, HT)

Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)



BUSY WAITING

Busy waiting = Computation



OVERDECOMPOSITION

0 s

200 s

400 s

600 s

800 s

1000 s


COSMO-SPECS+FD4 (unbalanced, HT)

Polling (b

usy waiting)

Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)



OVERDECOMPOSITION

0 s

50 s

100 s

150 s

200 s

250 s

300 s


COSMO-SPECS+FD4 (balanced, no HT)COSMO-SPECS+FD4 (unbalanced, no HT)

Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)



OVERDECOMPOSITION

0 s50 s

100 s150 s200 s250 s300 s350 s


CP2K (unbalanced, no HT)

Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)



OVERDECOMPOSITION

Barrier

processing units

time

With MPI: • Do not: busy wait (except very shortly) • Do: Block in kernel • Needs: fast unblocking of threads, when message comes in • We build: shortcut from IB driver into MPI threads (no Linux!)



CENTRALIZED

…

…….

..

..

..

..

..

..

..

..

…….



DECENTRALIZED

Node 1

Node 2

Node n

…

…….

..

..

..

..

..

..

..

..

…….



DECENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Node 1

Node 2

…

Node n



DECENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Node 1

Node 2

…

Node n



GOSSIP SCALABILITY

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s

Figure 3: PTRANS performance in GB/s (higher is better).

0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s

Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.

0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s


0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s


0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s


MPI-FFT running on BG/Q “JUQUEEN”

Low overhead: No noticeable overhead at gossip interval of 64–256 ms

0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s


0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s


0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s


0 250 500 750 1000

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

994 GB/s

990 GB/s

993 GB/s

980 GB/s

990 GB/s

984 GB/s

971 GB/s

973 GB/s

0 400 800 1200 1600

2048 Nodes

1646 GB/s

1654 GB/s

1653 GB/s

1637 GB/s

1623 GB/s

1585 GB/s

1555 GB/s

0 800 1600 2400 3200

4096 Nodes

3217 GB/s

3235 GB/s

3212 GB/s

3210 GB/s

3106 GB/s

3060 GB/s

3064 GB/s

0 1400 2800 4200 5600

8192 Nodes

5652 GB/s

5638 GB/s

5641 GB/s

5594 GB/s

5417 GB/s

5395 GB/s


0 10 20

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

1024 Nodes

19.0 s

19.0 s

19.0 s

19.0 s

19.1 s

19.2 s

19.5 s

20.0 s

12.2 s

12.2 s

12.2 s

12.2 s

12.3 s

12.4 s

12.7 s

13.2 s

0 20 40 60

2048 Nodes

50.2 s

50.2 s

50.4 s

50.5 s

51.1 s

52.0 s

54.0 s

36.4 s

36.4 s

36.5 s

36.6 s

37.2 s

38.1 s

40.0 s

0 20 40 60

4096 Nodes

40.0 s

40.0 s

40.2 s

40.7 s

42.6 s

45.3 s

32.9 s

32.9 s

33.1 s

33.6 s

35.5 s

38.1 s

0 20 40 60

8192 Nodes

27.8 s

28.0 s

28.5 s

29.9 s

35.2 s

42.2 s

24.0 s

24.2 s

24.7 s

26.1 s

31.4 s

38.4 s


0 10 20 30 40 50

Without Gossip

Interval = 1024 ms

Interval = 256 ms

Interval = 64 ms

Interval = 16 ms

Interval = 8 ms

Interval = 4 ms

Interval = 2 ms

Interval = 1 ms

1024 Nodes

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.6 s

40.7 s

40.8 s

41.1 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.2 s

8.3 s

8.4 s

8.6 s

0 10 20 30 40 50

2048 Nodes

36.7 s

36.7 s

36.7 s

36.7 s

36.8 s

36.9 s

37.2 s

38.0 s

4.3 s

4.2 s

4.2 s

4.2 s

4.3 s

4.4 s

4.7 s

5.5 s

0 10 20 30 40 50

4096 Nodes

36.7 s

36.7 s

36.7 s

36.8 s

37.3 s

37.6 s

38.7 s

4.3 s

4.3 s

4.4 s

4.4 s

4.8 s

5.1 s

6.1 s

0 10 20 30 40 50

8192 Nodes

37.3 s

37.3 s

37.3 s

37.5 s

37.9 s

38.2 s

4.8 s

4.8 s

4.9 s

5.0 s

5.3 s

5.6 s

Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.COSMO-SPECS+FD4 on BG/Q “JUQUEEN“

Quality of information: Average age at nodes in the order of 2–3 s with gossip interval of 256 ms

E. Levy, A. Barak, A. Shiloh, M. Lieber, C. Weinhold, and H. Härtig, „Overhead of a Decentralized Gossip Algorithm on the Performance of HPC Applications“, ROSS 2014



GOSSIP VS FAULTSTable 6: Simulations of the average master’s age with failed nodes.

Number of Circulating localfailed nodes windows of sizeper colony 16 32 64 128 256

0 11.74 9.67 8.66 8.20 8.071 11.71 9.72 8.67 8.21 8.072 11.75 9.68 8.70 8.21 8.084 11.81 9.73 8.70 8.23 8.118 11.83 9.79 8.72 8.28 8.1716 11.95 9.90 8.79 8.34 8.2032 12.12 10.05 8.96 8.48 8.36

Standard deviation 0.49 0.42 0.37 0.36 0.36Increase rate 3.2% 3.9% 3.5% 3.4% 3.6%

5.2.2 Using multiple masters

For increased resilience, the algorithm must continue to work even when the master fails. We

therefore extended the push algorithm to have several masters, where colony nodes send their

global windows to a randomly selected master. In addition, the masters exchange their latest

information, whereby each master sends one master window to a randomly selected master each

unit of time (not necessarily the same as the unit of time of the colonies). The master window

contains a fixed number of the youngest entries per colony.

Given a cluster with m masters and assuming that each colony node sends a global window with

probability kn , on average each master receives k

m global windows from each colony each unit of time.

The analysis performed for a single master can now be used by multiplying k in the single-master

analysis by m, the number of masters.

We verified the feasibility of the above extensions by running simulations with various sets of

parameters and found that the exchange of information between the masters significantly improves

the average master-age. Both the size and the frequency of sending master windows contributed

to this improvement.

Table 7 shows the simulation results of the average master-age with different numbers of masters

and master-failures. For this table, the following parameters were selected: colony size n = 1, 024

nodes; local windows of 64 entries; global windows of quarter-colony size, i.e. 256 entries and master

windows of the same size per colony; update rate k = m; and the masters’ unit of time is identical

to that of the colonies. Each result in Table 7 shows the average of 500 simulations.

The columns of Table 7 show that the penalty when masters fail is not too high: for example

only 17.1% when 4 out of 8 masters fail. This demonstrates the resilience of the extended algorithm.

The main diagonal, which presents the average master-age when only one node is active, is almost

fixed for any number of nodes.

An interesting observation from Table 7 is that the average master-age always decreases as the

number of masters increase. We would expect the average master-age to drastically increase as

21

Average age at master (1024 nodes per colony)Gossip is fault tolerant: Only slight increase in average age when substantial number of nodes fail (up to 32 of 1024 in each colony)

A. Barak, Z. Drezner, E. Levy, M. Lieber, and A. Shiloh, „Resilient gossip algorithms for collecting online management information in exascale clusters“, Concurrency and Computation: Practice and Experience, 2015



DECENTRALIZED

…….

..

..

..

..

..

..

..

..

…….

Node 1

Node 2

…

Node n


When MOSIX: Load difference discovered

Where MOSIX: Memory, cycles, comm

Which MOSIX: Past predicts future


DECENTRALIZED

Node 1

Node 2

…

Node n

When

+ Anomaly anticipated

Where + Topology

Which + Application knowledge



NODE ARCHITECTURE

Service OS





ProxiesMPI Library

MPI

Platform Management

Decision Making

Gossip

Infiniband InfinibandMonitor Mon

itor


Chkpt.

Checkpoint



MANAGEMENT

Decision Making

Hardware monitoring

MigrationPlatform info (Gossip)

Resource Prediction

Application hints

Past behavior

Future behavior?



APPLICATION HINTS

Resource prediction: CPU cycles Cache misses Memory Energy?



THERMAL PLANNING

processing units

time

Barrier



THERMAL PLANNING

Barrier

processing units

time



THERMAL PLANNING

Barrier

processing units

time



Resource prediction: CPU cycles Cache misses Memory Energy?

Fault tolerance: Which data to checkpoint (and when) Ability to handle node failures

49

APPLICATION HINTS



NODE ARCHITECTURE

Service OS





ProxiesMPI Library

MPI

Platform Management

Decision Making

Gossip

Infiniband InfinibandMonitor Mon

itor


Chkpt.

Checkpoint



FFMK prototype runs on real HPC system

We build OS/R for Exascale: Load management Fault tolerance

Take burden from application developers

51

SUMMARY

ffmk.tudos.org

http://ffmk.tudos.org

Documents

A MICROKERNEL-BASED OPERATING SYSTEM FOR ...CARSTEN WEINHOLD, TU DRESDEN A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING The Hebrew University of Jerusalem Amnon Barak