Taming the Beast - Some Thoughts On Exascale Resiliency
Dr. Peter Tröger, Senior Researcher Operating Systems and Middleware Group Hasso Plattner Institute University of Potsdam
Dr. Peter Tröger | HPCS 2013, Helsinki
The Exascale Beast
-Projections look similar
-Millions of nodes
-Billions of concurrent activities
-Faults in the order of minutes
-Checkpointing in the order of hours
-Logic soft errors and silent data corruption
-Software faults everywhere
-Power wall (20MW) for redundancy
-Resiliency for Exascale is a multi-dimensional beast
!2
[C. Engelmann]
[C. Engelmann]
Dr. Peter Tröger | HPCS 2013, Helsinki
You are not alone ...
!3
Dr. Peter Tröger | HPCS 2013, Helsinki
Dependability
-Umbrella term for operational requirements on a system
-„Trustworthiness of a computer system such that reliance can be placed on the service it delivers to the user“ [Laprie]
!!!!!
-Resiliency solution: A fault tolerance solution that allows the visibility of faults on uncommitted data [Bottomley]
System Quality
!4
Dr. Peter Tröger | HPCS 2013, Helsinki
Mission-Critical Single Systems
Large-ScaleDistributed Systems
Hardware Solutions
Hybrid Systems
Software Solutions Combined Solutions
Time
Dependability Research
Dr. Peter Tröger | HPCS 2013, Helsinki
Fault Tolerance [Hanmer]
-Error recovery: Roll-forward, rollback, retry, failover, ...
-Checkpointing is a tool to implement rollback
-Spatial redundancy is a tool to implement failover
-Error mitigation: Marked data (IEEE 754-1985 NaN), error correcting codes, algorithmic-based fault tolerance, ...
!5
Fault Tolerance
Latent Fault Error
Normal Operation
Fault Activation
Error Recovery
Error Mitigation
Error Detection
Failure
Dr. Peter Tröger | HPCS 2013, Helsinki
System View
!6
Hardware
Operating System / Hypervisor
Virtual Machine
Operating System
Cluster Framework
MPI
Library / Framework
Exec
utio
n en
viro
nmen
t
ApplicationProgramming model
What is best location for
fault tolerance ?
Dr. Peter Tröger | HPCS 2013, Helsinki
The Fancy Approach
-New resilient programming models + unreliable execution environments
-Power and cooling is hard enough
-Delegate the fault tolerance problem to the application
-Processes need to fail-fast, applications can then apply their own resiliency scheme
-Examples: Application-level checkpointing, User level failure mitigation (ULFM), new message passing facilities (e.g. Erlang)
-This demands active participation by HPC users
-Old codes are most likely to break completely
!X
Dr. Peter Tröger | HPCS 2013, Helsinki
The Traditional Approach
-New resilient execution environments + established programming models
-Hardware level: Despite interconnects, typically too costly; adds additional power demands
-OS / Cluster level: Page checkpointing, virtual clusters, ...
-MPI level: Coordinated checkpointing, message logging, data reliability, automatic path migration, FT-MPI, ...
-Seems to be preferred
-Old codes should be adoptable
!-But which layer is the right one ?
!X
Dr. Peter Tröger | HPCS 2013, Helsinki
Coverage vs. Overhead
-Migration object moved between failover units at one system layer
-System layer as containment barrier
-Coverage of the layer
-Fault model from available data
-Monitoring granularity may prevent fault detection for lower levels
-Overhead of the layer
-Migration object granularity
-Prediction quality (from data) influences false migration percentage
!X
Dr. Peter Tröger | HPCS 2013, Helsinki
Selection of a System Layer
-Strategy for identification of the right layer
-Choose a fault model - excludes all lower system layers
-Determine average time ∆F between error and failure
-Find highest system layer with average migration time below ∆F
-Reduce remaining candidates by prediction quality and overhead
!X
Dr. Peter Tröger | HPCS 2013, Helsinki
Selection of a System Layer
1.Choose a fault model, exclude all lower layers
2.Determine average time ∆F between error and failure
3.Find layers with average failover time below ∆F
4.Filter candidates by detection quality, failover speed and redundancy overhead
!7
Hardware
Operating System / Hypervisor
Virtual Machine
Operating System
Cluster Framework
MPI
Library / Framework
Exec
utio
n en
viro
nmen
t
Application
Dr. Peter Tröger | HPCS 2013, Helsinki
The Fault Model
-Error detection scheme
-Most effective fault tolerance scheme
-Sphere of redundancy
-Testing procedures
!8
[Laprie & Kanoun]
Fault M
odel
„One of the main problem is that it difficult to get fault
details on very recent machines and to anticipate what kind of faults are likely
to happen at Exascale.“ !
[Cappello]
„It is important to note that the number of failures with
undetermined root cause is significant. [...] hardware and software are among the largest
contributors to failures.“ !
[Schröder, Gibson]
Dr. Peter Tröger | HPCS 2013, Helsinki
The Exascale Dilemma
-Everybody agrees that Exascale resiliency is a problem.
-The budget for solving this problem is very limited.
-Performance is the top priority.
-Your HPC supplier has this problem too.
-The key issue is uncertainty.
-Fault model, failure modes and rates, error propagation paths, ...
-But many people still try to find ,the‘ right answer.
-Based on incomplete knowledge.
-Let‘s give up on that.
!11
System Quality
Dr. Peter Tröger | HPCS 2013, Helsinki
Our Proposal: Embrace the Uncertainty
-Create novel ways to deal with partial system knowledge
-Create incomplete fault models and start to use them
-Perform partial dependability assessments of designs
-Something in-between purely qualitative or quantitative
-Perform this iteratively
-Focus on relative comparisons
-Redundancy approach A is better than B
-Avoid the numbers game (e.g. MTTI)
-Make uncertainty explicit
!12
http
://a
myb
ruck
er.c
om/
Dr. Peter Tröger | HPCS 2013, Helsinki
Example 1: Anomaly Signals
-Extended version of anomaly detection approach by Oliver et al. (2010)
-Monitoring on different system levels has incompatible metrics
-Each error situation can best be identified by only one of the system layers
-Idea: Normalize and correlate health indicators across all system levels
!13 � � � �
���������
� �
� �
���������������
����
����
����
����
�����
��
�������������
��
�������������
��
�������������
����� �����
Hardware
Operating System / Hypervisor
Virtual Machine
Operating System
Cluster Framework
MPI
Library / Framework
Application
Dr. Peter Tröger | HPCS 2013, Helsinki
Example 2: FuzzTrees
-Dependability modeling in failure space is widely established
-Fault trees, attack trees, FMEA
-Describe potential failure modes and how they may occur
-Current approaches assume a fixed design and well-known fault probabilities
-FuzzTrees: Extended version of fault tree analysis
-Make uncertainty about system configuration explicit
-Make uncertainty about failure rates explicit
-Still get some answers
!14
Dr. Peter Tröger | HPCS 2013, Helsinki
N: 4-5k: N-2
Secondary CPUFailure
p=0.08 ± 0.008
Primary CPUFailure
p=0.08 ± 0.008
Server Failure
Power UnitFailure
p=0.15 ± 0.05
RAID 0 Failure RAID 1 Failure
Disc Failurep=0.12 ± 0.01
#2
Disc Failurep=0.12 ± 0.01
#2
-,Hello World‘ example
-Optional spare processor for failover
-Choice of RAID level
-Choice of power supply redundancy
-Consideration of cost factor
11.02.13 FuzzEd - Server Failure (1xCPU, N=4, RAID 0)
fuzztrees.net/editor/43 1/1
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 0
Power UnitFailure
#4
k/N: 2-4
11.02.13 FuzzEd - Server Failure (1xCPU, N=4, RAID 1)
fuzztrees.net/editor/42 1/1
Disc Failure#2
RAID 1
Power UnitFailure
#4
k/N: 2-4Primary CPUFailure
Server Failure
11.02.13 FuzzEd - Server Failure (1xCPU, N=5, RAID 0)
fuzztrees.net/editor/48 1/1
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 0
Power UnitFailure
#5
k/N: 3-5
11.02.13 FuzzEd - Server Failure (1xCPU, N=5, RAID 1)
fuzztrees.net/editor/49 1/1
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 1
Power UnitFailure
#5
k/N: 3-5
11.02.13 FuzzEd - Server Failure (2xCPU, N=4, RAID 0)
fuzztrees.net/editor/44 1/1
Secondary CPUFailure
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 0
Power UnitFailure
#4
k/N: 2-4
11.02.13 FuzzEd - Server Failure (2xCPU, N=4, RAID 1)
fuzztrees.net/editor/45 1/1
Secondary CPUFailure
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 1
Power UnitFailure
#4
k/N: 2-4
11.02.13 FuzzEd - Server Failure (2xCPU, N=5, RAID 0)
fuzztrees.net/editor/47 1/1
k/N: 3-5
Secondary CPUFailure
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 0
Power UnitFailure
#5
11.02.13 FuzzEd - Server Failure (2xCPU, N=5, RAID 1)
fuzztrees.net/editor/46 1/1
Secondary CPUFailure
Primary CPUFailure
Server Failure
Disc Failure#2
RAID 1
Power UnitFailure
#5
k/N: 3-5
FuzzTrees
!15
Online editor available at www.fuzztrees.net
Online editor available at www.fuzztrees.net
Dr. Peter Tröger | HPCS 2013, Helsinki
Summary
-Exascale resiliency is about uncertainty
-Everything fails in completely unforeseeable ways
-Reactive fault tolerance does not scale
-There is more monitoring data than smart mining approaches
-Surprisingly, industry agrees to most of this ...
-Promising new research directions
-Imprecise health indication with automated correlation
-Imprecise dependability modeling
!18