Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Dependable Computer
Systems
Part 6b: System Aspects
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Contents
• Synchronous vs. Asynchronous Systems
• Consensus
• Fault-tolerance by self-stabilization
• Examples
• Time-Triggered Ethernet (FT Clock Synchronization)
• Self-Stabilization in the Time-Triggered Architecture|
• Traffic Policing
• System Architectures
2
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Synchronous vs. asynchronous
systems
3
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
• Synchronous processors:
A processor is said to be synchronous if it makes at least one
processing step during real-time steps (or if some other
computer in the system makes s processing steps).
• Bounded communication:
The communication delay is said to be bounded if any
message sent will arrive at its destination within real-time
steps (if no failure occurs).
• Synchronous system:
A system is said to be synchronous if its processors are
synchronous and the communication delay is bounded.
Real-time systems are per definition synchronous.
• Asynchronous systems:
A system is said to be asynchronous if either its processors
are asynchronous or the communication delay is unbounded
4
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Consensus
5
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Consensus
• Each processor starts a protocol with its local input value,
which is sent to all other processors in the group, fulfilling the
following properties:
• Consistency: All correct processors agree on the same
value and all decisions are final.
• Non-triviality: The agreed-upon input value must have
been some processors input (or is a function of the
individual input values).
• Termination: Each correct processor decides on a value
within a finite time interval.
6
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Consensus (cont.)
• The consensus problem under the assumption of byzantine
failures was first defined in 1980 in the context of the SIFT
project which was aimed at building a computer system with
ultra-high dependability. Other names are
• byzantine agreement or byzantine general problem
• interactive consistency
7
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Impossibility of deterministic
consensus in asynch. systems
• asynchronous systems cannot achieve consensus by a
deterministic algorithm in the presence of even one crash
failure of a processor
• it is impossible to differentiate between a late response and a
processor crash
• by using coin flips, probabilistic consensus protocols can
achieve consensus in a constant expected number of rounds
• failure detectors which suspect late processors to be crashed
can also be used to achieve consensus in asynchronous
systems
8
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Impossibility of deterministic
consensus in asynch. systems (cont.)
n 3t + 1 processors are necessary to tolerate t failures
9
0 1
Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1
Round 1:
0
0
1
1
F
0 1
Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1
Round 2:
0
0
1
1
F
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Fault-tolerance by self-stabilization
10
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Fault-tolerance by self-stabilization
Self-Stabilization: A distributed system S is self-stabilizing with
respect to some global predicate P if it satisfies the following
two properties:
1. Closure: P is closed under the execution of S. That is,
once P is established
in S, it cannot be falsified.
2. Convergence: Starting from an arbitrary global state, S is
guaranteed to
reach a global state satisfying P within a finite number of
state transitions
11
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Fault-tolerance by self-stabilization (cont.)
• self-stabilizing systems need not be initialized and they can
recover from transient failures (adaptive DSD is self-
stabilizing)
• self-stabilization is a different approach to fault-tolerance, it is
not based on countering the effects of failures but
concentrating on the ability to reach a consistent state
• problems are how to achieve a global property with local
actions and local knowledge, lack of theory on how to design
self-stabilizing algorithms and how to guarantee timeliness
12
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Examples - Time-Triggered Ethernet (FT Clock Synchronization)
- Self-Stabilization in the Time-Triggered Architecture|
- Traffic Policing
- System Architectures
13
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
TTE
1588
1588
Eth
TTE
TTE
TTE
Eth
TTE
TTE
TTE
TTE
TTE
TTE
TTE
Eth
Time Master
Time Master
Time Master
IN 1IN 1
IN 1IN 1
IN 1
IN 1
Fault-tolerant synchronization services
are needed for establishing a robust
global time base
14
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Failure Model for High-Integrity Components:
Inconsistent-Omission Faulty
COM MON
IN OUT
listen_out
intercept
listen_in
Core COM/MON Assumptions:
- COM and MON fail independently
- MON can intercept a faulty message produced
by the COM
- COM cannot produce a valid message such that
this message appears as two different messages
on listen_out and OUT; though it may be valid on
listen_out but detectable faulty on OUT or vice versa
- MON cannot itself generate a faulty message,
neither by inverting listen_out to an output, nor by
toggling the intercept signal
15
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Synchronization Services
Per
fect C
lock
Real Time
Co
mp
ute
r T
ime
Slow Clock
Fast Clock
R.int
Me
ssa
ge
Exch
an
ge
R.int
Me
ssa
ge
Exch
an
ge
Clock Synchronization Service
Startup/Restart Service
Clock Synchronization Service is executed
during normal operation mode to keep the
local clocks synchronized to each other.
Startup/Restart Service is executed to reach
an initial synchronization of the local clocks in
the system.
Integration/Reintegration Service is used for
components to join an already synchronized
system.
Clique Detection Services are used to detect
loss of synchronization and establishment of
disjoint sets of synchronized components.
16
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Clock Synchronization Algorithm
Algorithm Specification
17
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Step 1: SMs send messages to CMs
Compression
Master
Synchronization
Master 5
Synchronization
Master 4Synchronization
Master 3
Synchronization
Master 2
Synchronization
Master 1
IN 1
IN 2 IN 3IN 4
IN 5
SM1Dispatch
Permanence SM1 SM2
SM2
SM5
SM5
SM3
SM4
SM3
SM4
t_0 t_1,
t_2
t_4,
t_5
Acceptance Window
(of SM 2/5)
CM
CM
Reference Point
Precision
...
18
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Step 2: CMs send voted clock values back
Compression
Master
Synchronization
Master 5
Synchronization
Master 4Synchronization
Master 3
Synchronization
Master 2
Synchronization
Master 1
IN CIN CIN CIN CIN C
SM1Dispatch
Permanence SM1 SM2
SM2
SM5
SM5
SM3
SM4
SM3
SM4
t_0 t_1,
t_2
t_4,
t_5
Acceptance Window
(of SM 2/5)
CM
CM
Reference Point
Precision
...
19
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Step 2: Multiple Channels/CMs
Compression
Master 2
Synchronization
Master 3
Synchronization
Master 2
Compression
Master 1
... ...
Multiple Channels/CMs are required for fault-tolerance.
Synchronization Masters (SMs) receive synchronization messages from all non-faulty Compression Masters (CMs)
SMs use either the median or the arithmetic mean on the redundant messages from the CMs.
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Step 2: Faulty CMs send to some SMs
Compression
Master
Synchronization
Master 5
Synchronization
Master 4Synchronization
Master 3
Synchronization
Master 2
Synchronization
Master 1
IN CIN CIN C
SM1Dispatch
Permanence SM1 SM2
SM2
SM5
SM5
SM3
SM4
SM3
SM4
t_0 t_1,
t_2
t_4,
t_5
Acceptance Window
(of SM 2/5)
CM
CM
Reference Point
Precision
...
Only one Channel and one CM
are shown. Typically replicated.
21
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Clock Synchronization Two Failures Scenario
Even in the byzantine failure case, the non-faulty clocks
remain synchronized with known bounds. 22
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Examples - Time-Triggered Ethernet (FT Clock Synchronization)
- Self-Stabilization in the Time-Triggered Architecture
- Traffic Policing
- System Architectures
23
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 24
Given a fault-tolerant system … … and some fault affecting
multiple components.
System state is inconsistent!
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Self-Stabilization (Dijkstra 1974)
Two properties:
• Closure
• Convergence
Closure: If a system is in a
legitimate state it will remain in
this state
Convergence: A system will
eventually reach a legitimate
state, from an arbitrary state
legitimate
illegitimate
25
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved Wilfried Steiner Real-Time Systems Group
Self-stabilization and fault-tolerance
Different failure detection
mechanism and failure
correction mechanisms for
different failures
Using a sequence of
algorithms to bring system
“nearer” to the safe state
26
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 27
Time-Triggered Architecture (cont.)
Real-Time
- Time is split-up into (not necessarily equal) slots depending on message length
- Slots are grouped into TDMA rounds
- Each node has assigned exactly one slot in the TDMA round
- This assignment is equal for every TDMA round
- Actual Transmission Phase < Sending Slot
TMDA round
1 2 3 4 5 6 7 8
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved Wilfried Steiner Real-Time Systems Group
Phases of the TTP
1 A
A
A
A
A A
2 A
A
S
S
A A
3 S
S
S
S
A A
Coldstart sync. < 4 nodes sync. >= 4 nodes
loss of synchronous
system view
A ... asynchronous node
S ... synchronous node
3 Phases:
• coldstart,
• synchronous < 4 nodes, no guarantees for the system services, sync.
messages are broadcasted periodically
• synchronous >= 4 nodes (normal system operation), sync. messages are
broadcasted periodically
28
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Phases of the TTP/C (cont.)
Identifying a 4th phase of protocol execution:
• A sufficient number of nodes is synchronous
1 A
A
A
A
A A
2 A
A
S
S
A A
3 S
S
S
S
A A
cold-start sync. < 4 nodes sync. >= 4 nodes
loss of synchronous
system view
A ... asynchronous node
S ... synchronous node
4
sync. >=threshold
S
S
S
S
S S
29
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Cliques
Nodes that communicate with each other form a clique
In correct operation mode only one clique
Possibility of more cliques after multiple transient failures
Two types of multiple cliques operation:
• Benign: Synchronous operation
• Malign: Asynchronous operation
30
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 31
Cliques (cont.)
Benign Cliques:
• Act still “slot-synchronously”
• accept and reject counters are
used to determine the amount
of nodes in the own clique
Malign Cliques:
• Multiple cliques act
“slot-asynchronously”
• Cliques do not “see” each other
Clique A sends in the IFGs of
Clique B and vice versa
IFG
Clique 1
1
3
5
76
4
2
8
TP
Clique 2
b)
1
2
3
45
6
7
8
TP
TP
IFG
Slot
a)
IFG
. . .
. . . ok!
malign!
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Self-Stabilization and the TTA (cont.)
Detect system misbehavior
using the Membership Vector:
• If >= (n/2)+1 nodes set in
the membership: ok!
• If < (n)/2+1 nodes set in
membership: restart!
Bring the system in a safe
state again by using the
startup algorithm and
reintegration of TTP
Am
ou
nt
of
no
de
s
Time
Threshold
Am
ou
nt
of
no
de
sTime
Threshold
32
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Examples - Time-Triggered Ethernet (FT Clock Synchronization)
- Self-Stabilization in the Time-Triggered Architecture
- Traffic Policing
- System Architectures
33
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Leaky-Bucket Traffic Policing
34
Switch/RouterReceiver
Sender
Rate-Constrained Traffic (RC)
min. duration min. duration min. duration
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Leaky-Bucket Traffic Policing (cont.)
• Token-Bucket / Leaky-Bucket algorithm are implemented to
control the behavior of rate-constrained traffic.
• In the case when a faulty end point attempts to send frames
too close back-to-back, then the token/leaky-bucket algorithm
will detect this behavior and drop the frame.
• Token/leaky-bucket algorithms may be expensive to
implement as they require to track the timing on a per
VL basis.
35
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Scheduled Traffic Policing
36
Transmission Sequence of a TT frame
1. Frame dispatch at t_dispatch
2. Frame start sending at t_send
3. Frame start receiving at t_receive
Temporal Correctness is checked via
Acceptance Window Test
• t_receive = t_send + l_link (l_link … link latency)
• t_acc = t_dispatch + l_link – Pi
(Pi … maximum distance of any two synchronized correct clocks in the system)
t_sendt_dispatch
t_receive
Time A
Acceptance Window
A
Pi Pi
max (d_shuffle)
t_sendt_dispatch
t_receive
Time D
t_acc
Time D
Time E
max (d_shuffle)
max (d_shuffle)
Link AD
Link DE
D EA
B C
H
G
F
D
E
Physical Topology Dataflow Path Virtual Link Maximum Clock
Difference
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Scheduled Traffic Policing (cont.)
• Scheduled traffic policing checks the correctness of a
received message with respect to a synchronized global
time.
• Scheduled traffic policing enforces minimum durations as
well as maximum durations in between two successive
messages of the same stream.
37
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Central Bus Guardian • Nodes and switches are synchronized to each other with a
precision Pi.
• In a cut-through setting the TDMA slot needs to account four
times the precision (4*Pi) as a safety margin.
• Only then it is guaranteed that traffic policing in the switch
does not truncate correct transmissions.
38
Transmission Window (Slot n)
Transmission Window (Slot n)
Physical Time
Fast Node
Guardian
Actual Transmission
Write Access Granted
Transmission Window (Slot n)
Slow Node
Maximum Clock
Difference
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Examples - Time-Triggered Ethernet (FT Clock Synchronization)
- Self-Stabilization in the Time-Triggered Architecture
- Traffic Policing
- System Architectures
39
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Distributed Integrated Modular
Avionics (Distributed IMA)
40
Non-Critical Subsystem
Synchronized Partitions
MODULE (N-1)MODULE 1
Partition 1
Task
1
Task
2
Task
3
Core Services VxWorks653 Time/Space Partitioning OS
Partition 2
Partition OS Partition OS
Task 1Task
2
Partition n
Partition OS
Task 1Task
2...
IMA OS Message Channel APITTE
Sync
Library
TTECOM Layer
ARINC 653
TTEthernet API Library
TTEthernet NIC Driver
TT
-tra
ffic
Ra
te-C
on
str
.
Be
st-
Effo
rt
TTEthernet NIC
Message Channels (Sampling/Queueing Ports)
Partition 1
Task
1
Task
2
Task
3
Core Services VxWorks653 Time/Space Partitioning OS
Partition 2
Partition OS Partition OS
Task 1Task
2
Partition n
Partition OS
Task 1Task
2...
IMA OS Message Channel APITTE
Sync
Library
TTECOM Layer
ARINC 653
TTEthernet API Library
TTEthernet NIC Driver
TT
-tra
ffic
Ra
te-C
on
str
.
Be
st-
Effo
rt
TTEthernet NIC
Message Channels (Sampling/Queueing Ports)
TTEthernet (SAE AS6802)
MODULE (N)
Partition 1
Task
1
Task
2
Task
3
Core Services VxWorks653 Time/Space Partitioning OS
Partition 2
Partition OS Partition OS
Task 1Task
2
Partition n
Partition OS
Task 1Task
2...
IMA OS Message Channel APITTE
COM Layer
ARINC 653
TTEthernet API Library
TTEthernet NIC Driver
TT
-tra
ffic
Ra
te-C
on
str
.
Be
st-
Effo
rt
TTEthernet NIC
Message Channels (Sampling/Queueing Ports)
Fault-Tolerant
Timebase
Module (N+1)
VxWorks 6.8
Task 1Task
2
Ethernet NIC
(ARINC664)
Module (N+2)
VxWorks 6.8
Task
1
Ethernet NIC
Module (N+3)
VxWorks 6.8
Task 1
Ethernet NIC
Module (N+4)
RT-Linux
Task 1Task
2
Ethernet NIC
Module (N+5)
Windows
Task 1Task
2
Ethernet NIC
Module (N+6)
Windows
Task 1Task
2
Ethernet NIC· Synchronous (TDM) & Asynchronous Comm.
· Distributed fault-tolerant timebase
· Robust TDMA partitioning
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Distributed Integrated Modular
Avionics (Distributed IMA) (cont.)
• The network is a distributed fault-tolerant embedded
computer and executes a set of partitions
• synchronous TDMA communication for Ethernet allows
integration of low-latency, low-jitter VLs in complex
networks
• „System-level partitioning“ closes the gap between
federated and integrated architectures
• The partitions can be mutually aligned and synchronized to
system time
41
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Automotive Integrated Safety
Platform
• zFAS, co-developed with TTTech, enabling Audi to integrate
a variety of innovative functions with multiple safety criticality
levels.
• zFAS uses numerous technology components from TTTech.
For example, the individual CPU cores are connected based
on Deterministic Ethernet communication
42
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Automotive Integrated Safety
Platform (cont.)
43
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Automotive Integrated Safety
Platform (cont.)
• Combines high-performance computing with functional safety
• Highly efficient deterministic SW integration
• Applications can be moved between embedded cores
• Supports various SoCs (System-on-a-Chip) and operating
systems
• Accelerated development due to PC-based co-simulation
44
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 45
Fog
Fog
Fog Fog
Fog
Fog
Fog Computing for the Dependable
Internet of Things A New Infrastructure Layer
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Fog Computing for the Dependable
Internet of Things (cont.)
Fog computing is an architecture approach that provides non-
functional knowledge to enable dependability in the IoT.
46
Thing Fog Node
Determinism
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Fog Computing for the Dependable
Internet of Things (cont.)
47
+ + ?
Multiple Components Modularized & Cross-Industry
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
Fog Computing for the Dependable
Internet of Things (cont.)
48
Moore’s Law Alive and Well Heterogeneous Multi-Core
Core 0 Core 1
Hardware Hypervisor Support
RTOS
Control
Apps
virtual
machine
Rich OS
IT / Media
Apps
Secure OS
Security
Apps
virtual
machine
virtual
machine
Safe OS
Safety
Apps
virtual
machine
Thin Software Hypervisor
Hardware Supported Virtualization at Chip Level Possible
Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved
“Fog Node on Wheels”
49