Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
CARSTEN WEINHOLD, TU DRESDEN
A MICROKERNEL-BASED OPERATING SYSTEM FOR EXASCALE COMPUTING
The Hebrew University of Jerusalem
Amnon Barak Hebrew University Jerusalem (HUJI) Hermann Härtig TU Dresden, Operating Systems Group (TUDOS) Wolfgang Nagel TU Dresden, Center for Information Services and HPC (ZIH) Alexander Reinefeld Konrad-Zuse-Zentrum für Informationstechnik Berlin (ZIB)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing
The ideal world assumption:
■ Identical, predictable, and reliable nodes
■ Fast and reliable interconnect
■ Balanced applications
■ Isolated partitions of fixed size
2
TRADITIONAL HPC
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 3
TRADITIONAL HPCTim
e
Fixed-size chunks of work
One thread per core
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing
Systems software:optimize communication latency
Message passing uses polling
Batch scheduler for start / stop
Separate servers for I/O
Small OS on each node
No OS on critical path
4
TRADITIONAL HPC
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 5
LOAD
Application: CP2K
-100
0
100
200
300
400
500
600
-5 0 5 10 15 20 25 30 35 40 45 50
Proc
ess
ID
Timestep
tracing
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Com
puat
ion
time
(frac
tion)
0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Com
puta
tion
time
(frac
tion)
1
129
257
384
512
10 20 30 40 50
Proc
ess I
D
Timesteps
Computation—communication ratio of CP2K on 512 cores
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 6
LOAD
Application: COSMO-SPECS+FD4
0 20 40 60 80
100 120
0 60 120 180 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Com
puta
tion
time
(frac
tion)
1
33
65
96
128
30 60 90 120 150 180
Proc
ess I
D
Timesteps
Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 7
REALITY CHECK
Hand balanced compute times of ranks per time step
Application: COSMO-SPECS+FD4
Unbalanced compute times of ranks per time step
Computation—communication ratio of COSMO-SPECS+FD4 on 128 cores
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 8
REALITY CHECK
Application: PRIME
Now think of: • Composite applications • In-situ visualization, etc.
Unbalanced compute times of ranks per time step
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 9
FFMK
SPPEXA – Findings & Goals
Massive parallelism (on- and cross-chip) requires fundamentally new concepts
Not “racks without brains”, but software is the key to this paradigm shift
Fundamental research ( DFG), in contrast to other (more application-oriented) initiatives ( German Federal Ministry of E & R)
Establish collaborative, interdisciplinary co-design of HPC applications and HPC methods
Focus on six research directions: Computational algorithms Application software Programming System software Data management and exploration Software tools
SPPEXA – Implementation
Two three-year funding phases
Overall budget of 3,7 M per year
Funded via DFG’s strategy fund
Interdisciplinary consortia of 3–5 groups
Consortia address at least two of SPPExa’s six research directions
Two-stage application process with (1) sketches and (2) full proposals
Global strategic coordination, following the established procedures of Collaborative Research Centres (SFB)
Close collaboration with respective inter-national programmes intended
w w w . s p p e x a . d e
SPPEXA – Chronology
2006: discussion in the German Research Foundation (DFG) on the necessity of a funding initiative for HPC software
2010: initiative out of German HPC commu-nity, referring to increasing activities on HPC software elsewhere (USA: NSF, DOE; Japan; China; G8)
2010: discussion with DFG’s Executive Committee, suggestion of a fl exible, strategically initiated SPP
2011: submission of the proposal, inter-national reviewing, and formal acceptance
2012: Review of project sketches and fullproposals
SPPEXA – Current Status
68 sketches handed in, overall volume of 19 M per year applied for
80 different universities, institutes and companies represented by 240 national and 15 international PIs
24 sketches invited for full proposals 13 full proposals accepted for funding Launch of programme and projects in
January 2013
German Priority Programme 1648“Software for Exascale Computing”
Boris
Leh
ner f
ür H
LRS
Image: Heiner Igel
Image: Wolfgang E. Nagel
FFMK: A Fast and Fault-Tolerant Microkernel-Based Operating System for Exascale Computing
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 10
CHALLENGES
Dynamic applications& platforms
Increased fault rates
Power / Dark silicon
Heterogeneity (cores, memory, … )
FFMK-OS FFMK-OS FFMK-OS
FFMK-OS FFMK-OS FFMK-OS
FFMK-OS FFMK-OS FFMK-OS
FFMK-OS FFMK-OS FFMK-OS
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 11
NODE ARCHITECTURE
Service OS
Runtime
ApplicationApplicationApplication
Light-weight Kernel (L4)
Service coresCompute cores
Monitoring
Platform Management
Runtime Support
Linux (drivers, etc.)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 12
3 ABSTRACTIONS
Address space
Address
space
Light-weight Kernel (L4)
ThreadsThreads
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 13
MESSAGE PASSING
App App
Light-weight Kernel (L4)
File System
Network I/O
Device Interrupt
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 14
BLOCK, POLL, IRET
• Intel Core i7 3770S @ 3100 MHz • No Hyperthreading, no Turboboost • No dynamic power management • Same socket
Block (Linux)Block (L4)
Polling
0 µs 1 µs 2 µs 3 µs 4 µs 5 µs 6 µs
0,1
3,2
5,3
• 64-bit Linux 3.11.6 (cpuidle.off=0, intel_idle.max_cstate=0)
• 64-bit L4/Fiasco.OC
Wake from interrupt on L4: 900 cycles, 0.3 µs
(best case, on Intel Core i7-4770 CPU @ 3.40GHz)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 15
LINUX ON L4
Linux OS
Linux App
File Network I/O
Light-weight Kernel (L4)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 16
HYBRID SYSTEM
Real-time
Security: small Trusted Computing Base
Resilience: small Reliable Computing Base
uncritical complex
critical simple
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 17
HYBRID SYSTEM
uncritical complex
critical simple
File Network I/O
Service OS
Light-weight Kernel (L4)
The Hebrew University of Jerusalem
Service OS
A Microkernel-Based OS for Exascale Computing 18
HYBRID SYSTEM
Service Proxy
File Network I/O
MPI App
MPI Lib
Infiniband
Light-weight Kernel (L4)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 19
DRIVER REUSE
L4 App
libibverbsUser-space Driver
Linux Kernel
IB Core
Kernel Driver
/dev/ib0I/O
Proxy App
Light-weight Kernel (L4)
Msg Buffer 1
Msg Buffer 2
Msg Buffer 1
Msg Buffer 2
I/O
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 20
NODE ARCHITECTURE
Service OS
Runtime ApplicationApplication
Runtime Linux Kernel
ApplicationApplicationApplication
Light-weight Kernel (L4)
ProxiesMPI Library
MPI
Infiniband Infiniband
Service coresCompute cores
Chkpt.
Checkpoint
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 21
FAILURE RATES
Kento Sato et al., „Design and Modeling of a Non-blocking Checkpointing System“, SC’12
MTBF for Component Failures in an HPC System
Failed Component # of Nodes Affected MTBF
PFS, core switch 1408 65.10 days
Rack 32 86.90 days
Edge switch 16 17.37 days
PSU 4 28.94 days
Compute nodes 1 15.8 hours
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing
In-memory XtreemFS volume for application-level checkpointing
RAID-5 erasure coding: recovery with 1 failed OSD
Demonstrator running BQCD code on a Cray XC30
22
XTREEMFS
DIR
MRC
OSD OSD OSD OSD
Application (BQCD)
…
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 23
BANDWIDTH
1
10
100
1000
10000
100000
16
32
64
128
256
512
1024
2048
4096
8192
GB/s
Processes
memcpy CRUISE ramdisk
(a) Zin Cluster (Linux)
0.1
1
10
100
1000
10000
1K 2K 4K 8K 16K 32K 64K 96K
TB/s
Nodes
memcpy 64ppn CRUISE 32ppn CRUISE 64ppn CRUISE 16ppn ramdisk 16ppn
(b) Sequoia Cluster (IBM Blue Gene/Q)
Figure 8: Aggregate Bandwidth Scalability of CRUISE
7.4 Large-Scale EvaluationCRUISE is designed to be used with large-scale clusters that
span thousands of compute nodes. We evaluated the scalingcapacity of this framework, and we show the results in Fig-ure 8. We conducted these evaluations on Zin and Sequoia.For each of these clusters, we measured the throughput ofCRUISE with increasing number of processes. In these experi-ments, we configured CRUISE to allocate a persistent memoryblock per process. On Zin, each process writes a 128MB file;on Sequoia, each writes a 50MB file. We compare CRUISE toRAM disk and a memory-to-memory copy of data withina process’ address space using memcpy ). Since CRUISE re-quires at least one memory copy to move data from theapplication bu er to its in-memory file storage, the memcpy
performance represents an upper-bound on throughput.On Zin (Figure 8(a)), the number of processes writing to
CRUISE was increased by a factor of two up to 8,192 processesalong the x-axis. The y-axis shows the bandwidth(GB/s) inlog-scale. As the graphs indicate, a perfect-linear scalingcan be observed on this cluster. Furthermore, CRUISE takescomplete advantage of the memory system’s bandwidth (theCRUISE plot overlaps the memcpy plot). The throughput ofCRUISE at 8,192 processes is 17.6 TB/s, which is only slightlybelow the memcpy throughput of 17.7 TB/s. The through-put of RAM disk is nearly half that of CRUISE at 9.87 TB/s.These runs used 17.5% of the available compute nodes. Ex-trapolation of this linear scaling to the full 46,656 processeswould lead to a throughput for CRUISE of over 100 TB/s.
Figure 8(b) shows the scaling trends on Sequoia. BecauseSequoia is capable of 4-way simultaneous multi-threading,a total of 6,291,456 parallel tasks can be executed. The x-axis provides the node-count for each data point, and the y-axis shows the bandwidth(TB/s) in log-scale. For clarity, weonly show the configurations that deliver the best results foraligned memcpy and RAM Disk. We show the results whenusing 16, 32, and 64 processes per node for CRUISE. At thefull-system scale of 6 million processes (64 processes/node),the aggregate aligned memcpy bandwidth reaches 1.21 PB/s.As observed in Figure 7(b), CRUISE nearly saturates thisbandwidth to deliver a throughput of 1.16 PB/s when run-ning with 32 processes per node. This is 20x faster than the
system RAM disk, which provides a maximum throughputof 58.9 TB/s, and it is 1000x faster than the 1 TB/s parallelfile system provided for the system.
8. REL TEDWORKLinker support to intercept library calls has been around
for a while. Darshan [8] intercepts an HPC application’scalls to the file system using linker support to profile andcharacterize the application’s I/O behavior. Similarly, fake-chroot [2] intercepts chroot ) and open ) calls to emulatetheir functionality without privileged access to the system.
Other researchers have investigated saving files in memoryfor performance. The MemFS project from Hewlett Packard[19] dynamically allocates memory to hold files. However,there is no persistence of the files after a process dies andMemFS requires kernel support. McKusick et al. presentan in-memory file system [16]. This e ort also requires ker-nel support, and it requires copies from kernel bu ers toapplication bu ers which would cause high overhead.
MEMFS is a general purpose, distributed file system im-plemented across compute nodes on HPC systems [27]. Un-like our approach, they do not optimize for the predominantform of I/O on these systems, checkpointing. Another gen-eral purpose file system for HPC is based on a concept calledcontainers which reside in memory [15]. While this workdoes consider optimizations for checkpointing, its focus ison asynchronous movement of data from compute nodes toother storage devices in the storage hierarchy of HPC sys-tems. Our work primarily di ers from these in that CRUISE isa file system optimized for fast node-local checkpointing.
Several e orts investigated checkpointing to memory in amanner similar to that of SCR [7, 9, 13, 23, 29, 30]. Theyuse redundancy schemes with erasure encoding for higherresilience. These works di er from ours in that they usesystem-provided in-memory or node-local file systems, suchas RAM disk, to store checkpoints. Rebound checkpoints tovolatile memory but focuses on single many-core nodes andoptimizes for highly-threaded applications [6].
Raghunath Rajachandrasekar et al., A 1 PB/s File System to Checkpoint Three Million MPI Tasks, HPDC’13
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 24
NODE ARCHITECTURE
Service OS
Runtime ApplicationApplication
Runtime Linux Kernel
ApplicationApplicationApplication
Light-weight Kernel (L4)
ProxiesMPI Library
MPI
Platform Management
Infiniband Infiniband
Service coresCompute cores
Chkpt.
Checkpoint
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 25
OVERDECOMPOSITION
processing units
time
Barrier
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 26
OVERDECOMPOSITION
Barrier
processing units
time
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 27
OVERDECOMPOSITION
processing units
time
Barrier
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 28
OVERDECOMPOSITION
Barrier
processing units
time
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 29
OVERDECOMPOSITION
0 s
200 s
400 s
600 s
800 s
1000 s
Baseline 2x 4x 8x 16x
COSMO-SPECS+FD4 (unbalanced, HT)
Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 30
BUSY WAITING
Busy waiting = Computation
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 31
OVERDECOMPOSITION
0 s
200 s
400 s
600 s
800 s
1000 s
Baseline 2x 4x 8x 16x
COSMO-SPECS+FD4 (unbalanced, HT)
Polling (b
usy waiting)
Oversubscribed runs: 32—256 MPI ranks on 4 quad-core nodes (w/ polling)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 32
OVERDECOMPOSITION
0 s
50 s
100 s
150 s
200 s
250 s
300 s
Baseline 2x 4x 8x 16x
COSMO-SPECS+FD4 (balanced, no HT)COSMO-SPECS+FD4 (unbalanced, no HT)
Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 33
OVERDECOMPOSITION
0 s50 s
100 s150 s200 s250 s300 s350 s
Baseline 2x 4x 8x 16x
CP2K (unbalanced, no HT)
Oversubscribed runs: 16—256 MPI ranks on 4 quad-core nodes (w/o polling)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 34
OVERDECOMPOSITION
Barrier
processing units
time
With MPI: • Do not: busy wait (except very shortly) • Do: Block in kernel • Needs: fast unblocking of threads, when message comes in • We build: shortcut from IB driver into MPI threads (no Linux!)
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 35
CENTRALIZED
…
…….
..
..
..
..
..
..
..
..
…….
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 36
DECENTRALIZED
Node 1
Node 2
Node n
…
…….
..
..
..
..
..
..
..
..
…….
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 37
DECENTRALIZED
…….
..
..
..
..
..
..
..
..
…….
Node 1
Node 2
…
Node n
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 38
DECENTRALIZED
…….
..
..
..
..
..
..
..
..
…….
Node 1
Node 2
…
Node n
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 39
GOSSIP SCALABILITY
0 250 500 750 1000
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
994 GB/s
990 GB/s
993 GB/s
980 GB/s
990 GB/s
984 GB/s
971 GB/s
973 GB/s
0 400 800 1200 1600
2048 Nodes
1646 GB/s
1654 GB/s
1653 GB/s
1637 GB/s
1623 GB/s
1585 GB/s
1555 GB/s
0 800 1600 2400 3200
4096 Nodes
3217 GB/s
3235 GB/s
3212 GB/s
3210 GB/s
3106 GB/s
3060 GB/s
3064 GB/s
0 1400 2800 4200 5600
8192 Nodes
5652 GB/s
5638 GB/s
5641 GB/s
5594 GB/s
5417 GB/s
5395 GB/s
Figure 3: PTRANS performance in GB/s (higher is better).
0 10 20
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
19.0 s
19.0 s
19.0 s
19.0 s
19.1 s
19.2 s
19.5 s
20.0 s
12.2 s
12.2 s
12.2 s
12.2 s
12.3 s
12.4 s
12.7 s
13.2 s
0 20 40 60
2048 Nodes
50.2 s
50.2 s
50.4 s
50.5 s
51.1 s
52.0 s
54.0 s
36.4 s
36.4 s
36.5 s
36.6 s
37.2 s
38.1 s
40.0 s
0 20 40 60
4096 Nodes
40.0 s
40.0 s
40.2 s
40.7 s
42.6 s
45.3 s
32.9 s
32.9 s
33.1 s
33.6 s
35.5 s
38.1 s
0 20 40 60
8192 Nodes
27.8 s
28.0 s
28.5 s
29.9 s
35.2 s
42.2 s
24.0 s
24.2 s
24.7 s
26.1 s
31.4 s
38.4 s
Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.
0 10 20 30 40 50
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
Interval = 1 ms
1024 Nodes
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.7 s
40.8 s
41.1 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.3 s
8.4 s
8.6 s
0 10 20 30 40 50
2048 Nodes
36.7 s
36.7 s
36.7 s
36.7 s
36.8 s
36.9 s
37.2 s
38.0 s
4.3 s
4.2 s
4.2 s
4.2 s
4.3 s
4.4 s
4.7 s
5.5 s
0 10 20 30 40 50
4096 Nodes
36.7 s
36.7 s
36.7 s
36.8 s
37.3 s
37.6 s
38.7 s
4.3 s
4.3 s
4.4 s
4.4 s
4.8 s
5.1 s
6.1 s
0 10 20 30 40 50
8192 Nodes
37.3 s
37.3 s
37.3 s
37.5 s
37.9 s
38.2 s
4.8 s
4.8 s
4.9 s
5.0 s
5.3 s
5.6 s
Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.
0 250 500 750 1000
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
994 GB/s
990 GB/s
993 GB/s
980 GB/s
990 GB/s
984 GB/s
971 GB/s
973 GB/s
0 400 800 1200 1600
2048 Nodes
1646 GB/s
1654 GB/s
1653 GB/s
1637 GB/s
1623 GB/s
1585 GB/s
1555 GB/s
0 800 1600 2400 3200
4096 Nodes
3217 GB/s
3235 GB/s
3212 GB/s
3210 GB/s
3106 GB/s
3060 GB/s
3064 GB/s
0 1400 2800 4200 5600
8192 Nodes
5652 GB/s
5638 GB/s
5641 GB/s
5594 GB/s
5417 GB/s
5395 GB/s
Figure 3: PTRANS performance in GB/s (higher is better).
0 10 20
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
19.0 s
19.0 s
19.0 s
19.0 s
19.1 s
19.2 s
19.5 s
20.0 s
12.2 s
12.2 s
12.2 s
12.2 s
12.3 s
12.4 s
12.7 s
13.2 s
0 20 40 60
2048 Nodes
50.2 s
50.2 s
50.4 s
50.5 s
51.1 s
52.0 s
54.0 s
36.4 s
36.4 s
36.5 s
36.6 s
37.2 s
38.1 s
40.0 s
0 20 40 60
4096 Nodes
40.0 s
40.0 s
40.2 s
40.7 s
42.6 s
45.3 s
32.9 s
32.9 s
33.1 s
33.6 s
35.5 s
38.1 s
0 20 40 60
8192 Nodes
27.8 s
28.0 s
28.5 s
29.9 s
35.2 s
42.2 s
24.0 s
24.2 s
24.7 s
26.1 s
31.4 s
38.4 s
Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.
0 10 20 30 40 50
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
Interval = 1 ms
1024 Nodes
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.7 s
40.8 s
41.1 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.3 s
8.4 s
8.6 s
0 10 20 30 40 50
2048 Nodes
36.7 s
36.7 s
36.7 s
36.7 s
36.8 s
36.9 s
37.2 s
38.0 s
4.3 s
4.2 s
4.2 s
4.2 s
4.3 s
4.4 s
4.7 s
5.5 s
0 10 20 30 40 50
4096 Nodes
36.7 s
36.7 s
36.7 s
36.8 s
37.3 s
37.6 s
38.7 s
4.3 s
4.3 s
4.4 s
4.4 s
4.8 s
5.1 s
6.1 s
0 10 20 30 40 50
8192 Nodes
37.3 s
37.3 s
37.3 s
37.5 s
37.9 s
38.2 s
4.8 s
4.8 s
4.9 s
5.0 s
5.3 s
5.6 s
Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.
MPI-FFT running on BG/Q “JUQUEEN”
Low overhead: No noticeable overhead at gossip interval of 64–256 ms
0 250 500 750 1000
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
994 GB/s
990 GB/s
993 GB/s
980 GB/s
990 GB/s
984 GB/s
971 GB/s
973 GB/s
0 400 800 1200 1600
2048 Nodes
1646 GB/s
1654 GB/s
1653 GB/s
1637 GB/s
1623 GB/s
1585 GB/s
1555 GB/s
0 800 1600 2400 3200
4096 Nodes
3217 GB/s
3235 GB/s
3212 GB/s
3210 GB/s
3106 GB/s
3060 GB/s
3064 GB/s
0 1400 2800 4200 5600
8192 Nodes
5652 GB/s
5638 GB/s
5641 GB/s
5594 GB/s
5417 GB/s
5395 GB/s
Figure 3: PTRANS performance in GB/s (higher is better).
0 10 20
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
19.0 s
19.0 s
19.0 s
19.0 s
19.1 s
19.2 s
19.5 s
20.0 s
12.2 s
12.2 s
12.2 s
12.2 s
12.3 s
12.4 s
12.7 s
13.2 s
0 20 40 60
2048 Nodes
50.2 s
50.2 s
50.4 s
50.5 s
51.1 s
52.0 s
54.0 s
36.4 s
36.4 s
36.5 s
36.6 s
37.2 s
38.1 s
40.0 s
0 20 40 60
4096 Nodes
40.0 s
40.0 s
40.2 s
40.7 s
42.6 s
45.3 s
32.9 s
32.9 s
33.1 s
33.6 s
35.5 s
38.1 s
0 20 40 60
8192 Nodes
27.8 s
28.0 s
28.5 s
29.9 s
35.2 s
42.2 s
24.0 s
24.2 s
24.7 s
26.1 s
31.4 s
38.4 s
Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.
0 10 20 30 40 50
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
Interval = 1 ms
1024 Nodes
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.7 s
40.8 s
41.1 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.3 s
8.4 s
8.6 s
0 10 20 30 40 50
2048 Nodes
36.7 s
36.7 s
36.7 s
36.7 s
36.8 s
36.9 s
37.2 s
38.0 s
4.3 s
4.2 s
4.2 s
4.2 s
4.3 s
4.4 s
4.7 s
5.5 s
0 10 20 30 40 50
4096 Nodes
36.7 s
36.7 s
36.7 s
36.8 s
37.3 s
37.6 s
38.7 s
4.3 s
4.3 s
4.4 s
4.4 s
4.8 s
5.1 s
6.1 s
0 10 20 30 40 50
8192 Nodes
37.3 s
37.3 s
37.3 s
37.5 s
37.9 s
38.2 s
4.8 s
4.8 s
4.9 s
5.0 s
5.3 s
5.6 s
Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.
0 250 500 750 1000
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
994 GB/s
990 GB/s
993 GB/s
980 GB/s
990 GB/s
984 GB/s
971 GB/s
973 GB/s
0 400 800 1200 1600
2048 Nodes
1646 GB/s
1654 GB/s
1653 GB/s
1637 GB/s
1623 GB/s
1585 GB/s
1555 GB/s
0 800 1600 2400 3200
4096 Nodes
3217 GB/s
3235 GB/s
3212 GB/s
3210 GB/s
3106 GB/s
3060 GB/s
3064 GB/s
0 1400 2800 4200 5600
8192 Nodes
5652 GB/s
5638 GB/s
5641 GB/s
5594 GB/s
5417 GB/s
5395 GB/s
Figure 3: PTRANS performance in GB/s (higher is better).
0 10 20
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
1024 Nodes
19.0 s
19.0 s
19.0 s
19.0 s
19.1 s
19.2 s
19.5 s
20.0 s
12.2 s
12.2 s
12.2 s
12.2 s
12.3 s
12.4 s
12.7 s
13.2 s
0 20 40 60
2048 Nodes
50.2 s
50.2 s
50.4 s
50.5 s
51.1 s
52.0 s
54.0 s
36.4 s
36.4 s
36.5 s
36.6 s
37.2 s
38.1 s
40.0 s
0 20 40 60
4096 Nodes
40.0 s
40.0 s
40.2 s
40.7 s
42.6 s
45.3 s
32.9 s
32.9 s
33.1 s
33.6 s
35.5 s
38.1 s
0 20 40 60
8192 Nodes
27.8 s
28.0 s
28.5 s
29.9 s
35.2 s
42.2 s
24.0 s
24.2 s
24.7 s
26.1 s
31.4 s
38.4 s
Figure 4: MPI-FFT runtime (lower is better). Inner red part indicates the MPI portion.
0 10 20 30 40 50
Without Gossip
Interval = 1024 ms
Interval = 256 ms
Interval = 64 ms
Interval = 16 ms
Interval = 8 ms
Interval = 4 ms
Interval = 2 ms
Interval = 1 ms
1024 Nodes
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.6 s
40.7 s
40.8 s
41.1 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.2 s
8.3 s
8.4 s
8.6 s
0 10 20 30 40 50
2048 Nodes
36.7 s
36.7 s
36.7 s
36.7 s
36.8 s
36.9 s
37.2 s
38.0 s
4.3 s
4.2 s
4.2 s
4.2 s
4.3 s
4.4 s
4.7 s
5.5 s
0 10 20 30 40 50
4096 Nodes
36.7 s
36.7 s
36.7 s
36.8 s
37.3 s
37.6 s
38.7 s
4.3 s
4.3 s
4.4 s
4.4 s
4.8 s
5.1 s
6.1 s
0 10 20 30 40 50
8192 Nodes
37.3 s
37.3 s
37.3 s
37.5 s
37.9 s
38.2 s
4.8 s
4.8 s
4.9 s
5.0 s
5.3 s
5.6 s
Figure 5: COSMO-SPECS+FD4 runtime (lower is better). Inner red part indicates the MPI portion.COSMO-SPECS+FD4 on BG/Q “JUQUEEN“
Quality of information: Average age at nodes in the order of 2–3 s with gossip interval of 256 ms
E. Levy, A. Barak, A. Shiloh, M. Lieber, C. Weinhold, and H. Härtig, „Overhead of a Decentralized Gossip Algorithm on the Performance of HPC Applications“, ROSS 2014
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 40
GOSSIP VS FAULTSTable 6: Simulations of the average master’s age with failed nodes.
Number of Circulating localfailed nodes windows of sizeper colony 16 32 64 128 256
0 11.74 9.67 8.66 8.20 8.071 11.71 9.72 8.67 8.21 8.072 11.75 9.68 8.70 8.21 8.084 11.81 9.73 8.70 8.23 8.118 11.83 9.79 8.72 8.28 8.1716 11.95 9.90 8.79 8.34 8.2032 12.12 10.05 8.96 8.48 8.36
Standard deviation 0.49 0.42 0.37 0.36 0.36Increase rate 3.2% 3.9% 3.5% 3.4% 3.6%
5.2.2 Using multiple masters
For increased resilience, the algorithm must continue to work even when the master fails. We
therefore extended the push algorithm to have several masters, where colony nodes send their
global windows to a randomly selected master. In addition, the masters exchange their latest
information, whereby each master sends one master window to a randomly selected master each
unit of time (not necessarily the same as the unit of time of the colonies). The master window
contains a fixed number of the youngest entries per colony.
Given a cluster with m masters and assuming that each colony node sends a global window with
probability kn , on average each master receives k
m global windows from each colony each unit of time.
The analysis performed for a single master can now be used by multiplying k in the single-master
analysis by m, the number of masters.
We verified the feasibility of the above extensions by running simulations with various sets of
parameters and found that the exchange of information between the masters significantly improves
the average master-age. Both the size and the frequency of sending master windows contributed
to this improvement.
Table 7 shows the simulation results of the average master-age with different numbers of masters
and master-failures. For this table, the following parameters were selected: colony size n = 1, 024
nodes; local windows of 64 entries; global windows of quarter-colony size, i.e. 256 entries and master
windows of the same size per colony; update rate k = m; and the masters’ unit of time is identical
to that of the colonies. Each result in Table 7 shows the average of 500 simulations.
The columns of Table 7 show that the penalty when masters fail is not too high: for example
only 17.1% when 4 out of 8 masters fail. This demonstrates the resilience of the extended algorithm.
The main diagonal, which presents the average master-age when only one node is active, is almost
fixed for any number of nodes.
An interesting observation from Table 7 is that the average master-age always decreases as the
number of masters increase. We would expect the average master-age to drastically increase as
21
Average age at master (1024 nodes per colony)Gossip is fault tolerant: Only slight increase in average age when substantial number of nodes fail (up to 32 of 1024 in each colony)
A. Barak, Z. Drezner, E. Levy, M. Lieber, and A. Shiloh, „Resilient gossip algorithms for collecting online management information in exascale clusters“, Concurrency and Computation: Practice and Experience, 2015
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 41
DECENTRALIZED
…….
..
..
..
..
..
..
..
..
…….
Node 1
Node 2
…
Node n
The Hebrew University of Jerusalem
When MOSIX: Load difference discovered
Where MOSIX: Memory, cycles, comm
Which MOSIX: Past predicts future
A Microkernel-Based OS for Exascale Computing 42
DECENTRALIZED
Node 1
Node 2
…
Node n
When
+ Anomaly anticipated
Where + Topology
Which + Application knowledge
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 43
NODE ARCHITECTURE
Service OS
Runtime ApplicationApplication
Runtime Linux Kernel
ApplicationApplicationApplication
Light-weight Kernel (L4)
ProxiesMPI Library
MPI
Platform Management
Decision Making
Gossip
Infiniband InfinibandMonitor Mon
itor
Service coresCompute cores
Chkpt.
Checkpoint
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 44
MANAGEMENT
Decision Making
Hardware monitoring
MigrationPlatform info (Gossip)
Resource Prediction
Application hints
Past behavior
Future behavior?
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 45
APPLICATION HINTS
Resource prediction: CPU cycles Cache misses Memory Energy?
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 46
THERMAL PLANNING
processing units
time
Barrier
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 47
THERMAL PLANNING
Barrier
processing units
time
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 48
THERMAL PLANNING
Barrier
processing units
time
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing
Resource prediction: CPU cycles Cache misses Memory Energy?
Fault tolerance: Which data to checkpoint (and when) Ability to handle node failures
49
APPLICATION HINTS
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing 50
NODE ARCHITECTURE
Service OS
Runtime ApplicationApplication
Runtime Linux Kernel
ApplicationApplicationApplication
Light-weight Kernel (L4)
ProxiesMPI Library
MPI
Platform Management
Decision Making
Gossip
Infiniband InfinibandMonitor Mon
itor
Service coresCompute cores
Chkpt.
Checkpoint
The Hebrew University of Jerusalem
A Microkernel-Based OS for Exascale Computing
FFMK prototype runs on real HPC system
We build OS/R for Exascale: Load management Fault tolerance
Take burden from application developers
51
SUMMARY
ffmk.tudos.org