Upload
takara
View
42
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A new transformation scheme based on active replication strategy that tolerates failures. Hamoudi Kalla , Alain Girault and Yves Sorel. Pop Art team and Aoste team. Paris, April 23, 2004. Outline. Introduction Model and problem State of the art - PowerPoint PPT Presentation
Citation preview
A new transformation scheme based on active replication strategy that
tolerates failures
Hamoudi Kalla, Alain Girault and Yves SorelPop Art team and Aoste team
Paris, April 23, 2004
2
Outline• Introduction
• Model and problem
• State of the art
• The proposed fault-tolerant method for tolerating :
• Processors failures
• Communication media failures
• Both processors and communication media failures
• Example
• Conclusion and future work
3
High level program
Compiler
Architecture specification
Distribution constraints
Execution times
Real-time constraints
Failure specification
Fault-tolerant distributed static schedule
Fault-tolerant distributed code
Code generator
Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic
Model of the algorithm
Introduction
4
Models : Application algorithm
a. Algorithm graph
« I1 and I2 » are inputs operations (sensors)
« O » is output operation (actuator)
« A, B and C » are computations operations
« A C » is data-dependence
I
1
B
A
C O
I2
5
b. Architecture graph
P1
P3
« P1, P2 and P3 » are processors
« L12, L13 and L23 » are point-to-point communication links
« B1 » is multipoint communication link
« com1 and com2 » are communicators
P1
P3
P2
B1
L23
L12
L13
P2
Models : Hardware architecture
Processor
Memory
operator com2
com1
Architecture with point-to-point links Architecture with multipoint links
6
1. Only processors and communication media (point-to-point and multipoint) can fails.
2. Failures can be characterized as transient or permanent.
3. At least a fixed number of processors can fail-stop.
4. At least a fixed number of communication media can fail-stop : partially or completely.
Partial communication
media failures
Processor failures
P1
P3
L23
L12
L13
P2 P1
P3
P2
m1
P1
P3
P2
m1
complete communication
media failures
Models : Component Failures
7
Find a distributed schedule of the algorithm on the architecture which is
fault-tolerantfault-tolerant to processors and communication media failures ?
Problem ?
I
1
B
A
C O
I2
SynDEx *SynDEx *SynDEx *SynDEx *
P1
P3
L23
L12
L13
P2
*SynDEx is a system level CAD software tool for optimizing the implementation of real-time embeded applications on multicomponenet architecture
architecture graph
algorithm graph
Distribution/scheduling
8
State of the art
“ A system is fault tolerant if it can mask the presence of faults in the system by using
hardware and/or software redundancy ”
(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures
I
1
B
A
C O
I2
SynDEx *SynDEx *SynDEx *SynDEx *
P1
P3
L23
L12
L13
P2
architecture graph
algorithm graph
Distribution/scheduling
P4
P4
9
State of the art
“ A system is fault tolerant if it can mask the presence of faults in the system by using
hardware and/or software redundancy ”
(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures
I
1
B
A
C O
I2
SynDEx *SynDEx *SynDEx *SynDEx *
P1
P3
L23
L12
L13
P2
architecture graph
algorithm graph
Distribution/scheduling
I1
A
10
State of the art
“ A system is fault tolerant if it can mask the presence of faults in the system by using
hardware and/or software redundancy ”
1. Active software redundancy : (Hashimoto et al., 2002(a); Fragopoulou and Akl, 1995(b))
(a) Multiple redundant copies of an operation are scheduled on different processors.
(b) Multiple redundant copies of a message are sent along disjoint paths.
2. Passive software redundancy : (Qin et al., 2002(a); Sriram et al., 1999(b))
(a) each operation is replicated on primary and backups copies, but only the primary is
executed.
(b) One copy of the message is sent, and if it fails, another copy will be transmitted.
(a) Approaches for tolerating processor failures (b) Approaches for tolerating communication media failures
11
Outline• Introduction
• Model and problem
• State of the art
• The proposed fault-tolerant method for tolerating :
• Processor failures
• Communication media failures (point-to-point links)
• Both processor and communication media failures
• Example
• Conclusion and future work
12
The Proposed fault-tolerant method
We use active software redundancy for both operations and communications.
Makes the recovery from failures bounded.
Motivations :
Principle (1) :
Easier to integrate to SynDEx.
Makes the system predictable.
13
The Proposed fault-tolerant method
Principle (2) :
Algorithm graph (Alg)
NPF processors failures
NLF links failures
Architecture graph (Arc)
Real-time and embedding
constraints
Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic
(SynDEx)(SynDEx)
Fault-tolerant distributed real-time executive
New Alg with redundancy
and exclusion relations
Graph transformationGraph transformation
14
The Proposed fault-tolerant method
A Bdata
A
A
. . .
B
B
. . .
NP
F+
1 replicas o
f B
NP
F+
1 replicas o
f A
a. initial algorithm sub-graph b1. final algorithm sub-graph
Algorithm graph transformation (1) : Tolerating NPF processors failures
15
The Proposed fault-tolerant method
A B
On
e replica o
f B
on
e replica o
f A NLF+1 replicas of data
b2. final algorithm sub-graph
A Bdata
a. initial algorithm sub-graph
Algorithm graph transformation (2) : Tolerating NLF links failures
16
The Proposed fault-tolerant method
Algorithm graph transformation (3) : Tolerating NPF processors and NLF links failures
A Bdata
a. Initial algorithm sub-graph b. Operations redundancy
A
A
B
B
two
replicas o
f A
two
replicas o
f Bc. Data-dependence redundancy
A
A
Bitw
o rep
licas of A
two
replicas o
f B
data
d. Data-dependence distribution (1)
A
A
Bi
two
replicas o
f A
two
replicas o
f B
data
R
A
Bi
two
replicas o
f B
e. Data-dependence distribution (2)
Atwo
replicas o
f Adata
NPF=1 and NLF=1
17
The Proposed fault-tolerant method
R
A1
Bi
two
replicas o
f B
e. Data-dependence distribution (2)
A1two
replicas o
f A
data
A1
A2
B1two
replicas o
f A
two
replicas o
f B
Case 1
R
data
B2
A1
A2
B1two
replicas o
f A
two
replicas o
f B
Case 2
R
data
B2
Algorithm graph transformation (4) : Tolerating NPF processors and NLF links failuresNPF=1 and NLF=1
18
The Proposed fault-tolerant method
b. final algorithm sub-graph
A Bdata
a. initial algorithm sub-graph
A
B
A
NP
F+
1 replicas o
f A
NP
F+
1 replicas o
f B
R
R
...
...
NLF routing operations R
Algorithm graph transformation (5) : Tolerating NPF processors and NLF links failuresNPF>=1 and NLF>=1
19
The Proposed fault-tolerant method
NPF processors failures
NLF links failures
Architecture graph Arc
Real-time and embedding
constraints
Distribution and scheduling Distribution and scheduling fault-tolerant fault-tolerant heuristicheuristic
(SynDEx)(SynDEx)
Fault-tolerant distributed real-time executive
Graph transformationGraph transformation
A Bdata
A
B
A
NP
F+
1 replica o
f A
NP
F+
1 replica o
f B
R
R
......
NLF routing operations R
New Alg with redundancy
and exclusion relations
20
The Proposed fault-tolerant method
B1 will receive its input data NPF+NLF+1 times (NPF=1, NLF=1);as soon as it
receives the first input, B1 is executed, and it ignores the later inputs
Implantation
R
A2
B1
two
replicas o
f B
A1two
replicas o
f A
data
P1
P4
L23
L12
L14
P2
P3L34
SynDExSynDEx SynDExSynDEx
P1 P2 P3 P4L12 L23 L34 L14
A2 A1
B1
data data
L24
L24
data
datadata
start time (B1) = min ( end communication [A1,A2,R] )
a transformed algorithm sub-graph
architecture graph
Temporary schedule
RR
time
B1
21
Outline• Introduction
• Model and problem
• State of the art
• The proposed fault-tolerant method for tolerating :
• Processor failures
• Communication media failures (multipoint links)
• Both processor and communication media failures
• Example
• Conclusion and future work
22
1. We use the active sactive software redundancyoftware redundancy of operations; where each operation is
replicated on NPF+1 different processors to tolerate NPF processors failures.
P1 P2
B1 B2
P3 P4
architecture graphAlgorithm sub-graph Temporary schedule
The Proposed fault-tolerant method
23
2. Use the passive software redundancypassive software redundancy of communication
The Proposed fault-tolerant method
3. Split each data communication on NLF messages (data fragmentation)(data fragmentation)
24
Why data data fragmentation fragmentation ?
1. Distinction between complete and partialcomplete and partial communication links failures
The Proposed fault-tolerant method
2. Enable rapid recoveryrapid recovery from processors and communication links failures
25
1. Recovery from processor failures
The Proposed fault-tolerant method
26
2. Recovery from partial communication links failures
The Proposed fault-tolerant method
27
3. Recovery from complete communication media failures
The Proposed fault-tolerant method
28
Example (1)
29
Example (2)
30
Conclusion and future work
Benchmarks.
Using passive redundancy to tolerate communication links failures.
Taking into account sensors and actuators failures.
A new method to tolerate both communication links both communication links and processor processor failuresfailures in distributed real-time systems, which may be reduce the overhead of the recovery from failures.
Result
Future work
31
References
[Hashimoto et al., 2002].
Hashimoto, K., Tsuchiya, T., and Kikuno, T. (2002). Effective scheduling of duplicated tasks for fault
tolerance in multiprocessor systems. IEICE Transactions on Information and Systems.
[Fragopoulou and Akl, 1995]. Fragopoulou, P. and Akl, S.G. (1995). Fault tolerant communication algorithms on the star network using
disjoint paths. In Proceedings of the 28th Hawaii International Conference on System Sciences, HICSS’95 , Kingston, Canada.
[Qin et al., 2002].
Qin, X., Jiang, H., and Swanson, D.R. (2002). An efficient fault-tolerant scheduling algorithm for real-time
tasks with precedence constraints in heterogeneous systems. In Proceedings of the 31th International
Conference on Parallel Processing, Vancouver, Canada.
[Sriram et al., 1999].
Sriram, R., Manimaran, G., and Murthy, C.S.R. (1999). An integrated scheme for establishing dependable
real-time channels in multihop networks. In Proc. ICCCN, pages 528–533.
32
Questions Questions ??