Upload
dale-hood
View
229
Download
0
Embed Size (px)
DESCRIPTION
3 Software and Services Group 3 Motivation: Highly adaptive computing for exascale Critical exascale issues (inspired by work on UHPC and X-Stack) Require the ability to move currently executing parts of the app to another place in the platform or to a later time. Resilience −Fragile components −Lots of them Power management −Power components off −Power components down Self-aware computing −Modify mapping based on feedback Change of goals −Between power and time to solution, for example Thesis: management of the execution frontiers in CnC is a mechanism supporting highly adaptive computing for exascale.
Citation preview
1Software and Services Group 1
Execution FrontiersCnC support for highly adaptive execution
Kath Knobe Intel
12/07/12
2Software and Services Group 2
Warning • This is all high level conceptual thinking• Many details to be determined• Today: just the basic idea without any concern for efficiency.• Lots of room for optimizing
Suggestions /comments more than welcome!
3Software and Services Group 3
Motivation: Highly adaptive computing for exascale
Critical exascale issues (inspired by work on UHPC and X-Stack)Require the ability to move currently executing parts of the app to another place in the platform or to a later time.
• Resilience−Fragile components−Lots of them
• Power management−Power components off−Power components down
• Self-aware computing−Modify mapping based on feedback
• Change of goals−Between power and time to solution, for example
Thesis: management of the execution frontiers in CnC is a mechanism supporting highly adaptive computing for exascale.
4Software and Services Group 4
Checkpoint/restart Hierarchical CnC
Hierarchical checkpoint/restart
Hierarchical checkpoint/restartFor adaptive execution
2 passes - Abstract: unlimited resources - Actual: with resource constraints
For faults
5Software and Services Group 5
Outline• Abstract (platform has infinite memory and processors)
−Semantic state−Checkpoint/restart−Hierarchical CnC −Hierarchical checkpoint/restart
• Actual (with resource constraints)• Beyond faults
6Software and Services Group 6
Outline• Abstract
−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart
• Actual • Beyond faults
7Software and Services Group 7
Outline• Abstract
−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart
• Actual • Beyond faults
8Software and Services Group 8
Semantics / execution model
Itemavail
tagavail
9Software and Services Group 9
Semantics / execution model
Itemavail
stepcontrolReady
stepdataReady
tagavail
10Software and Services Group 10
Semantics / execution model
Itemavail
stepcontrolReady
stepready
stepdataReady
tagavail
11Software and Services Group 11
Semantics / execution model
Itemavail
stepcontrolReady
stepready
stepdataReady
tagavail
12Software and Services Group 12
Semantics / execution model
Itemavail
stepcontrolReady
stepready
stepdataReady
tagavail
13Software and Services Group 13
Semantics / execution model
Itemavail
stepcontrolReady
stepready
stepdataReady
stepexecuted
tagavail
14Software and Services Group 14
Semantics / execution model
Itemavail
stepcontrolReady
stepready
stepdataReady
stepexecuted
tagavail
The primitive attributes come from below: available, executed The derived attributes propagate at this level: control_ready, data_ready, ready
2 levels:• Graph level (above)• User serial code level (below)
15Software and Services Group 15
Execution frontier• An execution frontier is a CnC program state:
−The set of attributes of instances of steps, tags and items−The contents of available items
• CnC execution can proceed from a execution frontier
• Some examples of execution frontiers:− Normal program input (set of available items and tags)− Normal program output (set of available items and tags)− Any state during execution (more general)
• Perspective− Traditional focus:
> Data structure is items; computation is step.> step instance consumes and produces items.
− Alternate view: > Data structure is execution frontier; computation is step, subgraph or full program.> Applying a computation to an execution frontier yields another execution frontier.
16Software and Services Group 16
Outline• Abstract
−Semantic state−Checkpoint/restart−Hierarchical CnC −Hierarchical checkpoint/restart
• Actual • Beyond faults
17Software and Services Group 17
Checkpoint/restart summary(abstract)• Changes to the execution frontier are saved continuously as they occur
• Changes are saved in less volatile “place”• Asynchronous, no barriers• No programmer involvement• Saved state may not correspond to an actual state • Can restart from any saved state
18Software and Services Group 18
Outline• Abstract
−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart
• Actual • Beyond faults
19Software and Services Group 1919
Cholesky domain spec
TrisolveTag: row, iter
CholeskyTag: iter
UpdateTag: col, row, iter
CONTROL TAG
CONTROL TAG
CONTROL TAG
Cholesky: iter
Trisolve: row, iter
Update: col, row, iter
COMPUTE STEP
COMPUTE STEP
COMPUTE STEP
Array : col, row, iter
DATA ITEM
20Software and Services Group 20
Looks like a CnC spec at each level
<iterTag: iter>CONTROL TAG
COMPUTE STEP(C: iter)
21Software and Services Group 21
Looks like a CnC spec at each level
iterations<iterTag: iter>CONTROL TAG
COMPUTE STEP(cholesky:)
COMPUTE STEP(C: iter)
COMPUTE STEP(TU:)
22Software and Services Group 22
Looks like a CnC spec at each level
<iterTag: iter>CONTROL TAG
COMPUTE STEP(C: iter)
COMPUTE STEP(U:)
COMPUTE STEP(trisolve)
<rowTag: row>CONTROL TAG
COMPUTE STEP(cholesky:)
COMPUTE STEP(TU:)
23Software and Services Group 23
get…get…… = .. + … *… /… = …if …put
Executed semantics: leafCOMPUTE STEP(trisolve: row)
Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below
24Software and Services Group 24
Executed semantics: non-leaf
COMPUTE STEP(U:)
COMPUTE STEP(trisolve)
<rowTag: row>CONTROL TAG
COMPUTE STEP(TU:)
Executed is a primitive attribute. It comes from below. - Leaf : termination of the serial code below- non-leaf: termination of the subgraph below
25Software and Services Group 25
Hierarchical CnC application: execution is at the leaves only
Cholesky
trisolve
update
26Software and Services Group 26
Hierarchical CnC application: intermediate nodes maintain state
State of each iteration
State of each row
27Software and Services Group 27
Hierarchical view of the abstract platform tree
A node looks like a full machine at each level:a subtree of the memory hierarchy + the associated set of cores
Hierarchical platform node
28Software and Services Group 28
Abstract platform:Depth and extent of platform hierarchy corresponds exactly
to the depth and extent of the dynamic application
The mapping is direct
29Software and Services Group 29
Outline• Abstract
−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart
• Actual • Beyond faults
30Software and Services Group 30
Hierarchical checkpoint / restart(abstract)
Hierarchical application node
31Software and Services Group 31
Hierarchical checkpoint/restart(abstract)
Checkpoint for that application node
Hierarchical application node
32Software and Services Group 32
Hierarchical checkpoint/restart(abstract)
Checkpoint for that application node
resides at the parent place
Hierarchical application node
33Software and Services Group 33
Hierarchical checkpoint/restart(abstract)
Checkpoint for that application node
resides at the parent place
Hierarchical application node
Distinct checkpoints residing at a single place remain separate.
We will see why later.
34Software and Services Group 34
Abstract failure model
• The system knows if/when a node fails − We’re not talking about soft errors
• Abstract platform node fails temporarily then returns
35Software and Services Group 35
Hierarchical checkpoint/restart(abstract)
1-level Checkpoint• Fault • Fullstop• Restart
36Software and Services Group 36
Hierarchical checkpoint/restart(abstract)
1-level Checkpoint• Fault • Fullstop• Restart
37Software and Services Group 37
Hierarchical checkpoint/restart(abstract)
1-level Checkpoint• Fault • Fullstop• Restart
38Software and Services Group 38
Hierarchical checkpoint/restart(abstract)
1-level Checkpoint• Fault • Fullstop• Restart
39Software and Services Group 39
Hierarchical checkpoint/restart(abstract)
Checkpoint in hierarchy• Fault • Fullstop• Restart
40Software and Services Group 40
Hierarchical checkpoint/restart(abstract)
Checkpoint in hierarchy• Fault • Fullstop• Restart
41Software and Services Group 41
Hierarchical checkpoint/restart(abstract)
Checkpoint in hierarchy• Fault • Fullstop• Restart
42Software and Services Group 42
Hierarchical checkpoint/restart(abstract)
Checkpoint in hierarchy• Fault • Fullstop• Restart
43Software and Services Group 43
Hierarchical checkpoint/restart(abstract)
Checkpoint in hierarchy• Fault • Fullstop• Restart
From above: step simply looks like it took longer than expected.
Checkpoint/fullstop at one node looks like checkpoint/continue for the whole program
44Software and Services Group 44
Hierarchical checkpoint/restart:Summary
• Each node in a hierarchy has all the characteristics of a whole program checkpoint.
• Checkpoint/fullstop/restart at nodes in the hierarchy enables the application as a whole to adapt and continue through faults.
45Software and Services Group 45
Outline• Abstract • Actual: with resources and resource constraints
−Semantic state−Checkpoint/restore−Hierarchical CnC −Hierarchical checkpoint/restart
• Beyond faults
46Software and Services Group 46
Semantic state for execution(limited memory)
• Checkpointed information leaves the trailing edge of the execution frontier−Dead tags−Dead items−Dead stepsThis is the motivation for the term “execution frontier” as opposed to “execution state”. It’s only the relevant frontier of the state.
• Dead is a derived attribute. It doesn’t propagate up from the children. It is derived independently within each (sub)program.
47Software and Services Group 47
Hierarchical CnC map to actual platformplatform: limited depth / limited extent at each level
Platform hierarchy
Application hierarchy
48Software and Services Group 48
Hierarchical CnC map to actual platformflatten the depth
Platform hierarchy
Application hierarchy
49Software and Services Group 49
Hierarchical CnC map to actual platformfold extent
Platform hierarchy
Application hierarchy
50Software and Services Group 50
Actual failure model
• Platform node fails and may not return − or don’t want to wait until it returns
• Restart is at some other platform node
51Software and Services Group 51
Remapping
A B
Map:
52Software and Services Group 52
Remapping
A B
A B
Map:
53Software and Services Group 53
Remapping
X
A BY
A B
Map: Original checkpoint of B is at XNew checkpoint of B is at YFollows the new platform location
A B
A B
54Software and Services Group 54
Remapping
X
A BY
A B
Map: Original checkpoint of B is at XNew checkpoint of B is at YFollows the new platform location
A B
A B
This is why we don’t want to merge checkpoints of the application children at the platform parent.
We may want to relocate each child independently.
55Software and Services Group 55
What do we have?
• A way of maintaining the execution frontier of −A running application−A running subgraph of an application
• A mechanism for taking an execution frontier and moving it−To another place−To a later time
• Use of this to cope with faults
56Software and Services Group 56
Outline• Abstract • Actual: with resources and resource constraints• Beyond faults
57Software and Services Group 57
Adaptive execution• If we can checkpoint and continue elsewhere on a fault, we
can checkpoint and continue elsewhere for our own reasons. Big relevant exascale issues:−Resilience• Actual/predicted failures
−Power management−Self-aware computing−Changes in goals
• Mechanism not policy!• Status:
−No staffing or funding yet.
58Software and Services Group 58
Other uses of execution frontiers
• Mechanism for connecting reusable components• Low priority app
− Execute/checkpoint/restart one step at a time − Stop mid-step when high priority work arrives
• Long-lived app with very slowly arriving input − e.g., phylogenetic tree for SARS virus
• Debugging− View state− Reverse time (undo)
• Soft-errors−Compute more than once. Compare
• Something like out-of-core computation but not baked into application
59Software and Services Group 59
Potential: Forms & operationsForms • As executing
− general, arrays, trees…• Serialized• Streaming• Encrypted• Compressed• Database • Excel • Human readable
Operations • Save/restore• Partition/specialize
−At fork into distinct large subgraphs
• Merge −At join of distinct large subgraphs
• Send • Compare (e.g., for fault
tolerance)• Explicitly modify (e.g., debug)• Rename collections (e.g., for
composition
60Software and Services Group 60
Relook at motivation: Highly adaptive computing for exascale
Critical exascale issues:require the ability to move currently executing parts of the app to another place in the platform or to a later time.
• Resilience−Fragile components−Lots of them
• Power management−Power components off−Power components down
• Self-aware computing−Modify mapping based on feedback
• Change of goals−Between power and time to solution, for example
Looking forward to:• Lowering the design• Implementation• Experimenting
Looking for feedback and collaborators
61Software and Services Group 61