Dependable Computer Systems - Institute of Computer ......deterministic algorithm in the presence of even one crash ... Failure Model for High-Integrity Components: Inconsistent-Omission

Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved

Dependable Computer

Systems

Part 6b: System Aspects


Contents

• Synchronous vs. Asynchronous Systems

• Consensus

• Fault-tolerance by self-stabilization

• Examples

• Time-Triggered Ethernet (FT Clock Synchronization)

• Self-Stabilization in the Time-Triggered Architecture|

• Traffic Policing

• System Architectures

2


Synchronous vs. asynchronous

systems

3


• Synchronous processors:

A processor is said to be synchronous if it makes at least one

processing step during real-time steps (or if some other

computer in the system makes s processing steps).

• Bounded communication:

The communication delay is said to be bounded if any

message sent will arrive at its destination within real-time

steps (if no failure occurs).

• Synchronous system:

A system is said to be synchronous if its processors are

synchronous and the communication delay is bounded.

Real-time systems are per definition synchronous.

• Asynchronous systems:

A system is said to be asynchronous if either its processors

are asynchronous or the communication delay is unbounded

4


Consensus

5


Consensus

• Each processor starts a protocol with its local input value,

which is sent to all other processors in the group, fulfilling the

following properties:

• Consistency: All correct processors agree on the same

value and all decisions are final.

• Non-triviality: The agreed-upon input value must have

been some processors input (or is a function of the

individual input values).

• Termination: Each correct processor decides on a value

within a finite time interval.

6


Consensus (cont.)

• The consensus problem under the assumption of byzantine

failures was first defined in 1980 in the context of the SIFT

project which was aimed at building a computer system with

ultra-high dependability. Other names are

• byzantine agreement or byzantine general problem

• interactive consistency

7


Impossibility of deterministic

consensus in asynch. systems

• asynchronous systems cannot achieve consensus by a

deterministic algorithm in the presence of even one crash

failure of a processor

• it is impossible to differentiate between a late response and a

processor crash

• by using coin flips, probabilistic consensus protocols can

achieve consensus in a constant expected number of rounds

• failure detectors which suspect late processors to be crashed

can also be used to achieve consensus in asynchronous

systems

8


Impossibility of deterministic

consensus in asynch. systems (cont.)

n 3t + 1 processors are necessary to tolerate t failures

9

0 1

Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1

Round 1:

0

0

1

1

F

0 1

Maj(0, 0, 1) = 0 Maj(0, 1, 1) = 1

Round 2:

0

0

1

1

F


Fault-tolerance by self-stabilization

10


Fault-tolerance by self-stabilization

Self-Stabilization: A distributed system S is self-stabilizing with

respect to some global predicate P if it satisfies the following

two properties:

1. Closure: P is closed under the execution of S. That is,

once P is established

in S, it cannot be falsified.

2. Convergence: Starting from an arbitrary global state, S is

guaranteed to

reach a global state satisfying P within a finite number of

state transitions

11


Fault-tolerance by self-stabilization (cont.)

• self-stabilizing systems need not be initialized and they can

recover from transient failures (adaptive DSD is self-

stabilizing)

• self-stabilization is a different approach to fault-tolerance, it is

not based on countering the effects of failures but

concentrating on the ability to reach a consistent state

• problems are how to achieve a global property with local

actions and local knowledge, lack of theory on how to design

self-stabilizing algorithms and how to guarantee timeliness

12


Examples - Time-Triggered Ethernet (FT Clock Synchronization)

- Self-Stabilization in the Time-Triggered Architecture|

- Traffic Policing

- System Architectures

13


TTE

1588

1588

Eth

TTE

TTE

TTE

Eth

TTE

TTE

TTE

TTE

TTE

TTE

TTE

Eth

Time Master

Time Master

Time Master

IN 1IN 1

IN 1IN 1

IN 1

IN 1

Fault-tolerant synchronization services

are needed for establishing a robust

global time base

14


Failure Model for High-Integrity Components:

Inconsistent-Omission Faulty

COM MON

IN OUT

listen_out

intercept

listen_in

Core COM/MON Assumptions:

- COM and MON fail independently

- MON can intercept a faulty message produced

by the COM

- COM cannot produce a valid message such that

this message appears as two different messages

on listen_out and OUT; though it may be valid on

listen_out but detectable faulty on OUT or vice versa

- MON cannot itself generate a faulty message,

neither by inverting listen_out to an output, nor by

toggling the intercept signal

15


Synchronization Services

Per

fect C

lock

Real Time

Co

mp

ute

r T

ime

Slow Clock

Fast Clock

R.int

Me

ssa

ge

Exch

an

ge

R.int

Me

ssa

ge

Exch

an

ge

Clock Synchronization Service

Startup/Restart Service

Clock Synchronization Service is executed

during normal operation mode to keep the

local clocks synchronized to each other.

Startup/Restart Service is executed to reach

an initial synchronization of the local clocks in

the system.

Integration/Reintegration Service is used for

components to join an already synchronized

system.

Clique Detection Services are used to detect

loss of synchronization and establishment of

disjoint sets of synchronized components.

16


Clock Synchronization Algorithm

Algorithm Specification

17


Step 1: SMs send messages to CMs

Compression

Master

Synchronization

Master 5

Synchronization

Master 4Synchronization

Master 3

Synchronization

Master 2

Synchronization

Master 1

IN 1

IN 2 IN 3IN 4

IN 5

SM1Dispatch

Permanence SM1 SM2

SM2

SM5

SM5

SM3

SM4

SM3

SM4

t_0 t_1,

t_2

t_4,

t_5

Acceptance Window

(of SM 2/5)

CM

CM

Reference Point

Precision

...

18


Step 2: CMs send voted clock values back

Compression

Master

Synchronization

Master 5

Synchronization


Master 3

Synchronization

Master 2

Synchronization

Master 1

IN CIN CIN CIN CIN C

SM1Dispatch

Permanence SM1 SM2

SM2

SM5

SM5

SM3

SM4

SM3

SM4

t_0 t_1,

t_2

t_4,

t_5

Acceptance Window

(of SM 2/5)

CM

CM

Reference Point

Precision

...

19


Step 2: Multiple Channels/CMs

Compression

Master 2

Synchronization

Master 3

Synchronization

Master 2

Compression

Master 1

... ...

Multiple Channels/CMs are required for fault-tolerance.

Synchronization Masters (SMs) receive synchronization messages from all non-faulty Compression Masters (CMs)

SMs use either the median or the arithmetic mean on the redundant messages from the CMs.


Step 2: Faulty CMs send to some SMs

Compression

Master

Synchronization

Master 5

Synchronization


Master 3

Synchronization

Master 2

Synchronization

Master 1

IN CIN CIN C

SM1Dispatch

Permanence SM1 SM2

SM2

SM5

SM5

SM3

SM4

SM3

SM4

t_0 t_1,

t_2

t_4,

t_5

Acceptance Window

(of SM 2/5)

CM

CM

Reference Point

Precision

...

Only one Channel and one CM

are shown. Typically replicated.

21


Clock Synchronization Two Failures Scenario

Even in the byzantine failure case, the non-faulty clocks

remain synchronized with known bounds. 22



- Self-Stabilization in the Time-Triggered Architecture

- Traffic Policing


23

Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved 24

Given a fault-tolerant system … … and some fault affecting

multiple components.

System state is inconsistent!


Self-Stabilization (Dijkstra 1974)

Two properties:

• Closure

• Convergence

Closure: If a system is in a

legitimate state it will remain in

this state

Convergence: A system will

eventually reach a legitimate

state, from an arbitrary state

legitimate

illegitimate

25

Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved Wilfried Steiner Real-Time Systems Group

Self-stabilization and fault-tolerance

Different failure detection

mechanism and failure

correction mechanisms for

different failures

Using a sequence of

algorithms to bring system

“nearer” to the safe state

26


Time-Triggered Architecture (cont.)

Real-Time

- Time is split-up into (not necessarily equal) slots depending on message length

- Slots are grouped into TDMA rounds

- Each node has assigned exactly one slot in the TDMA round

- This assignment is equal for every TDMA round

- Actual Transmission Phase < Sending Slot

TMDA round

1 2 3 4 5 6 7 8

Course: Dependable Computer Systems 2016, © Stefan Poledna, All rights reserved Wilfried Steiner Real-Time Systems Group

Phases of the TTP

1 A

A

A

A

A A

2 A

A

S

S

A A

3 S

S

S

S

A A

Coldstart sync. < 4 nodes sync. >= 4 nodes

loss of synchronous

system view

A ... asynchronous node

S ... synchronous node

3 Phases:

• coldstart,

• synchronous < 4 nodes, no guarantees for the system services, sync.

messages are broadcasted periodically

• synchronous >= 4 nodes (normal system operation), sync. messages are

broadcasted periodically

28


Phases of the TTP/C (cont.)

Identifying a 4th phase of protocol execution:

• A sufficient number of nodes is synchronous

1 A

A

A

A

A A

2 A

A

S

S

A A

3 S

S

S

S

A A

cold-start sync. < 4 nodes sync. >= 4 nodes

loss of synchronous

system view

A ... asynchronous node

S ... synchronous node

4

sync. >=threshold

S

S

S

S

S S

29


Cliques

Nodes that communicate with each other form a clique

In correct operation mode only one clique

Possibility of more cliques after multiple transient failures

Two types of multiple cliques operation:

• Benign: Synchronous operation

• Malign: Asynchronous operation

30


Cliques (cont.)

Benign Cliques:

• Act still “slot-synchronously”

• accept and reject counters are

used to determine the amount

of nodes in the own clique

Malign Cliques:

• Multiple cliques act

“slot-asynchronously”

• Cliques do not “see” each other

Clique A sends in the IFGs of

Clique B and vice versa

IFG

Clique 1

1

3

5

76

4

2

8

TP

Clique 2

b)

1

2

3

45

6

7

8

TP

TP

IFG

Slot

a)

IFG

. . .

. . . ok!

malign!


Self-Stabilization and the TTA (cont.)

Detect system misbehavior

using the Membership Vector:

• If >= (n/2)+1 nodes set in

the membership: ok!

• If < (n)/2+1 nodes set in

membership: restart!

Bring the system in a safe

state again by using the

startup algorithm and

reintegration of TTP

Am

ou

nt

of

no

de

s

Time

Threshold

Am

ou

nt

of

no

de

sTime

Threshold

32




- Traffic Policing


33


Leaky-Bucket Traffic Policing

34

Switch/RouterReceiver

Sender

Rate-Constrained Traffic (RC)

min. duration min. duration min. duration


Leaky-Bucket Traffic Policing (cont.)

• Token-Bucket / Leaky-Bucket algorithm are implemented to

control the behavior of rate-constrained traffic.

• In the case when a faulty end point attempts to send frames

too close back-to-back, then the token/leaky-bucket algorithm

will detect this behavior and drop the frame.

• Token/leaky-bucket algorithms may be expensive to

implement as they require to track the timing on a per

VL basis.

35


Scheduled Traffic Policing

36

Transmission Sequence of a TT frame

1. Frame dispatch at t_dispatch

2. Frame start sending at t_send

3. Frame start receiving at t_receive

Temporal Correctness is checked via

Acceptance Window Test

• t_receive = t_send + l_link (l_link … link latency)

• t_acc = t_dispatch + l_link – Pi

(Pi … maximum distance of any two synchronized correct clocks in the system)

t_sendt_dispatch

t_receive

Time A

Acceptance Window

A

Pi Pi

max (d_shuffle)

t_sendt_dispatch

t_receive

Time D

t_acc

Time D

Time E

max (d_shuffle)

max (d_shuffle)

Link AD

Link DE

D EA

B C

H

G

F

D

E

Physical Topology Dataflow Path Virtual Link Maximum Clock

Difference


Scheduled Traffic Policing (cont.)

• Scheduled traffic policing checks the correctness of a

received message with respect to a synchronized global

time.

• Scheduled traffic policing enforces minimum durations as

well as maximum durations in between two successive

messages of the same stream.

37


Central Bus Guardian • Nodes and switches are synchronized to each other with a

precision Pi.

• In a cut-through setting the TDMA slot needs to account four

times the precision (4*Pi) as a safety margin.

• Only then it is guaranteed that traffic policing in the switch

does not truncate correct transmissions.

38

Transmission Window (Slot n)


Physical Time

Fast Node

Guardian

Actual Transmission

Write Access Granted


Slow Node

Maximum Clock

Difference




- Traffic Policing


39


Distributed Integrated Modular

Avionics (Distributed IMA)

40

Non-Critical Subsystem

Synchronized Partitions

MODULE (N-1)MODULE 1

Partition 1

Task

1

Task

2

Task

3

Core Services VxWorks653 Time/Space Partitioning OS

Partition 2

Partition OS Partition OS

Task 1Task

2

Partition n

Partition OS

Task 1Task

2...

IMA OS Message Channel APITTE

Sync

Library

TTECOM Layer

ARINC 653

TTEthernet API Library

TTEthernet NIC Driver

TT

-tra

ffic

Ra

te-C

on

str

.

Be

st-

Effo

rt

TTEthernet NIC

Message Channels (Sampling/Queueing Ports)

Partition 1

Task

1

Task

2

Task

3


Partition 2


Task 1Task

2

Partition n

Partition OS

Task 1Task

2...


Sync

Library

TTECOM Layer

ARINC 653



TT

-tra

ffic

Ra

te-C

on

str

.

Be

st-

Effo

rt

TTEthernet NIC


TTEthernet (SAE AS6802)

MODULE (N)

Partition 1

Task

1

Task

2

Task

3


Partition 2


Task 1Task

2

Partition n

Partition OS

Task 1Task

2...


COM Layer

ARINC 653



TT

-tra

ffic

Ra

te-C

on

str

.

Be

st-

Effo

rt

TTEthernet NIC


Fault-Tolerant

Timebase

Module (N+1)

VxWorks 6.8

Task 1Task

2

Ethernet NIC

(ARINC664)

Module (N+2)

VxWorks 6.8

Task

1

Ethernet NIC

Module (N+3)

VxWorks 6.8

Task 1

Ethernet NIC

Module (N+4)

RT-Linux

Task 1Task

2

Ethernet NIC

Module (N+5)

Windows

Task 1Task

2

Ethernet NIC

Module (N+6)

Windows

Task 1Task

2

Ethernet NIC· Synchronous (TDM) & Asynchronous Comm.

· Distributed fault-tolerant timebase

· Robust TDMA partitioning


Distributed Integrated Modular

Avionics (Distributed IMA) (cont.)

• The network is a distributed fault-tolerant embedded

computer and executes a set of partitions

• synchronous TDMA communication for Ethernet allows

integration of low-latency, low-jitter VLs in complex

networks

• „System-level partitioning“ closes the gap between

federated and integrated architectures

• The partitions can be mutually aligned and synchronized to

system time

41


Automotive Integrated Safety

Platform

• zFAS, co-developed with TTTech, enabling Audi to integrate

a variety of innovative functions with multiple safety criticality

levels.

• zFAS uses numerous technology components from TTTech.

For example, the individual CPU cores are connected based

on Deterministic Ethernet communication

42



Platform (cont.)

43



Platform (cont.)

• Combines high-performance computing with functional safety

• Highly efficient deterministic SW integration

• Applications can be moved between embedded cores

• Supports various SoCs (System-on-a-Chip) and operating

systems

• Accelerated development due to PC-based co-simulation

44


Fog

Fog

Fog Fog

Fog

Fog

Fog Computing for the Dependable

Internet of Things A New Infrastructure Layer



Internet of Things (cont.)

Fog computing is an architecture approach that provides non-

functional knowledge to enable dependability in the IoT.

46

Thing Fog Node

Determinism




47

+ + ?

Multiple Components Modularized & Cross-Industry




48

Moore’s Law Alive and Well Heterogeneous Multi-Core

Core 0 Core 1

Hardware Hypervisor Support

RTOS

Control

Apps

virtual

machine

Rich OS

IT / Media

Apps

Secure OS

Security

Apps

virtual

machine

virtual

machine

Safe OS

Safety

Apps

virtual

machine

Thin Software Hypervisor

Hardware Supported Virtualization at Chip Level Possible


“Fog Node on Wheels”

49

Documents

Dependable Computer Systems - Institute of Computer ......deterministic algorithm in the presence of even one crash ... Failure Model for High-Integrity Components: Inconsistent-Omission