40
1 ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

  • Upload
    neorah

  • View
    19

  • Download
    1

Embed Size (px)

DESCRIPTION

ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems. Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee. Outline. Motivation and related work ORTEGA goals ORTEGA architecture Details of ORTEGA designs - PowerPoint PPT Presentation

Citation preview

Page 1: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

1

ORTEGA: An Efficient and Flexible Online Fault Tolerance Architecture for Real-Time Control Systems

Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha,

Hui Ding, Kihwal Lee

Page 2: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

2

Outline

Motivation and related work ORTEGA goals ORTEGA architecture Details of ORTEGA designs Implementation and evaluation Demo

Page 3: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

3

Motivations

Cyber-Physical Systems Real-world systems involves not only computer

science, but knowledge related to various disciplines.

Not only the computer system becomes more complex, the complexity of integrated system (i.e. the cyber-physical system) grows even faster.

Major challenge: how to let engineers of drastically different backgrounds collaborate with each other?

Page 4: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

4

Motivations

Control Systems Conventional analog control systems Digital control systems

Computer Systems Real-time scheduling Fault tolerance Reliable/online software upgrade

We need to design a framework so that computer engineers and control engineers can easily collaborate and integrate their knowledge

Kxu

BuAxx

)()(

)()()(

khKxkhu

khBudsekhxehkhxh

0

AsAh

Page 5: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

5

Motivations

Control Systems Conventional analog control systems Digital control systems

Computer Systems Real-time scheduling Fault tolerance Reliable/online software upgrade

We need to design a framework so that computer engineers and control engineers can easily collaborate and integrate their knowledge

Kxu

BuAxx

)()(

)()()1(

kKxku

kGukFxkx

Page 6: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

6

Related work: Simplex architecture

Demand: Low cost development of upgraded control

systems for mission critical control applications instead of multi-versioning, just develop one version Focus on the control theories

Runtime upgrade/testing of the single version buggy new system.

Applications: Aircraft control (F-16, Seto et. al, 2000) Submarine control (NSSN, new attack

submarine program at US navy)

Page 7: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

7

Simplex for real-time control

Simple high assurancecontrol subsystem (HAC)

Complex high performancecontrol subsystem (HPC)

Plant

Decision

Simplex Architecture

Page 8: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

8

Simplex for real-time control

AxBKxxA

BuxAx

The above LTI control system is stable iff there exists a P>0, such that the Lyapunov function

0)( xPAPAx TT

Given LTI control system:

Page 9: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

9

Simplex for real-time control

Maximum Stability Region (Recovery

Region)

Stability Region

Lyapunov Functions

State Constraints

We can choose smaller solution ellipsoid (i.e. xTPx < xTPmaxx) to leave margins to guard against model/actuator/measurement errors.

Page 10: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

10

Drawbacks of Simplex

P1: Lack of Efficiency Analytically redundant high assurance controller

(HAC) runs in parallel with complex controller (HPC)

Lowers system performance, increase operating costs Limits the application of Simplex in only safety-critical

domains P2: Lack of Flexibility

Enforces the same execution period on HAC and HPC

In practice, different controllers may use different periods for different performance considerations

For example: fast HAC recovery

Page 11: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

11

Design goals of ORTEGA

On-demand Real-TimE GuArd (ORTEGA) A new efficient fault tolerance software

architecture designed for real-time control systems

More efficient resource usage (P1) Through on-demand real-time recovery

Flexible design (P2) Allows HAC and HPC to run at different rates Through new design and schedulability analysis

Applicable to a wider range of real-time control systems

Page 12: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

12

ORTEGA Architecture

Page 13: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

13

On-demand execution of HAC

At any time, only one of the HAC or HPC is running to control the plant

Decision module (DM) uses a mutex semaphore to control which of the HAC and HPC is running When the HPC is running well, the HAC blocks on the

semaphore; Only when a fault is detected in the HPC, the DM

releases the semaphore to allow HAC to take over Decision logic is based on stability regions

Determined through Linear Matrix Inequality theory Details later

Page 14: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

14

CPU savings of ORTEGA

HPC’s timing parameters: {Cp, Tp}; HAC’s timing parameters: {Ca, Ta};

Pr: the percentage of time for recovery (HAC) during a total time of T

• Total CPU resource usage under Simplex

• Total CPU resource usage under ORTEGA

• CPU resource usage savings:

Page 15: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

15

No Free Lunch: An extra period of delay

up to Ta incurred due to the on-demand execution of HAC

k

Fault Detected

p p a a

p a

pa

p

k

Simplex

pa papa

p

p

a

pp

1st time of HAC ‘s control output takes action

1st time of HAC ‘s control output takes action

t

t

t2

t1

ORTEGA

Simplex

ORTEGA

Page 16: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

16

Handle the extra delay by state projections

Stability Regiont: decision time

a

projected state x(t+Ta)

tt+Ta

current state x(t)

(1)Extra delay causes disturbances when fault occurs (infrequent)

(2)But the gain in resource usage is large.

Resource usage reduction v.s. extra delay :

Page 17: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

17

Recovery region design

Maximum Stability Region (Recovery

Region)

Stability Region

Lyapunov Functions

State Constraints

• The decision module uses recovery region to determine when to switch to HAC

• Recovery region is defined as the maximum region in which the HAC can make the plant stable

Page 18: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

18

Determine recovery region (1)

( ) ( )u k Kx k

( 1) ( ) (*)x k Fx k ( )F F GK

1 1 , . (1)Tmx m q State constraints:

Digital controllers:

Stability region: The discrete LTI control system is stable iff there exists a P>0, such that 0 PFPF T

Page 19: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

19

Determine recovery region (1)

( ) ( )u k Kx k

( 1) ( ) (*)x k Fx k ( )F F GK

1 1 , . (1)Tmx m q State constraints:

Digital controllers:

Stability region: Stability region of the system with respect to P is defined as { | 1}Tx x Px

Page 20: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

20

Determine recovery region (2)

Area of recovery region1logdetMaximize P

0s t P

1 1 1Tm mP m q

Theorem: Determine the maximum stability region of digital implemented closed loop system with constraints (1) can be

transformed to the following MAXDET (LMI) problem.

0TPF PF Stability

State constraints

Maximum Stability Region (Recovery

Region)

Stability Region

Lyapunov Functions

State Constraints

Maximum Stability Region (Recovery

Region)

Stability Region

Lyapunov Functions

State Constraints

Page 21: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

21

Recovery region v.s. control loop period

Stability Index A(T): Area of the maximum stability region

• It is a function of the control loop period T. The smaller the controller loop period, the larger the maximum stability region.

Example: an inverted pendulum

0 0 1 0 0

0 0 0 1 0

0 2 7528 10 9526 0 0043 1 9432

0 28 5812 24 9179 0 0441 4 4385

x x u

( ) [5 7807, 42 2087,14 0953, 8 6016] ( )u k x k

Controller

0 0.005 0.01 0.015 0.02 0.025 0.03 0.0350.5

1

1.5

2

2.5

3

3.5

4x 10

-7

Control Loop PeriodS

tab

ility

In

de

x

The smaller the period, the larger the recovery region.

System model

ORTEGA allows larger recovery region (more flexible)

Page 22: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

22

Implementation and evaluation

• Inverted pendulum from Quanser

• CPU: Pentium II 350MHz

• OS: Linux kernel 2.4.18-3 with RMS

• HAC: field tested state feedback controller

Evaluation of CPU savings

• If HAC and HPC both run at 50Hz, ORTEGA’s CPU saving is 29.29%

• If HAC runs at 50Hz, HPC runs at 20Hz, ORTEGA’s CPU saving is 50.87%

Page 23: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

23

Evaluation of fault tolerance

Infinite loop bug Non-performing bug Maximum control output bug Divided by zero bug Bang-Bang type bug Positive feedback bug Tricky design bug …

Page 24: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

24

Evaluation of fault tolerance

Page 25: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

25

Evaluation of fault tolerance

Page 26: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

26

Thank You Thank You

Q&A Q&A

Page 27: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

27

Backup Slides

Page 28: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: software engineering economics: the more effort, the more reliable

)/exp()(

)/exp()exp()(

EkCtER

EkCtttR

Reliability:0 failure

happened during [0,

t] Failure Rate

Complexity

Effort

Page 29: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: comparison with N-version

23 )3/(3)3/()( ERERER

Page 30: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: roots in recover-block: only one version must be correct

3))3/(1(1)( ERER

Page 31: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: recovery-blcok: dividing into more alternatives doesn’t always gain.

Page 32: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: recovery-blcok: reducing complexity gains

One alternative has complexity 1, ½, and 1/10

Page 33: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: a two-alternatives recovery block with reduced complexity wins if a reliable acceptance test is possible.

Page 34: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

Simplex: recovery-blcok: reducing complexity gains

RB2: 2 alter, same complexity C=1, perfect acceptance test

*RB2L5: 2 alter, C1=1, C2=1/5, imperfect acceptance test whose reliability = alter. 2

Page 35: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

35

Schedulability analysis of ORTEGA

Page 36: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

36

Mode-Change Problem Incurred by Recovery

Example: Suppose one plant 1p : (C1

p,T1p) = (3,5); 1

a : (C1a,T1

a) = (4,10) ;

with another real time task 2 : (C2,T2) = (6,15).

Unschedulable of tasks due to the recovery

• Before the recovery at t=10, {1p , 2 } = {(3,5), {6,15}} is schedulable;

• After the recovery transition, {1a , 2 } = {(4,10), {6,15}} is also schedulable;

• However, during the transition of recovery, 2 misses its deadline at t=15!

Mode-change in fixed priority scheduling is a well-recognized difficult problem by the real-time community

(3,5) ->(4,10)

5 10 15

(6,15)

5 10 15

20

Miss deadline!

Mode changeincurred by recovery

0

0

1p 1

a

2

Page 37: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

37

Schedulability Analysis

Schedulability Analysis: We adopt the work by Real and Crespo (2004)

i

k

RRk

p kp

x

wi(x)

ka k

a

kp k

a

Idea: Analyze the transitional scheduling overhead incurred by the recovery.

(I) Schedulability analysis of steady state task set

(II) Schedulability analysis of old-mode tasks with transitional scheduling overhead (due to the mode change)

(III) Schedulability analysis of new-mode tasks with transitional scheduling overhead (due to the mode change)

Page 38: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

38

Fault Tolerance and Scheduling Co-design

-- one FT-enabled task case

Maximize the recovery region subject to

schedulability constraint

Find the smallest (optimal) control loop period Tk*a, s.t. the task set is schedulable under random recoveries

Given the schedulability test, we can use binary

search algorithm to find Tk*a

Page 39: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

39

Sampling time h,

Zero-order hold( ) ,AhF h e

0( )

h AsG h e dsB

P2: Recovery Region for Digital Controllers

Controller ( ) ( )u k Kx k

( 1) ( )x k Fx k ( )F F GK

Theorem (Lyapunov): A discrete time LTI system shown above is stable iff there exists a matrix P>0, such that

0.TPF PF

Page 40: Xue Liu, Qixin Wang, Sathish Gopalakrishnan, Wenbo He, Lui Sha, Hui Ding, Kihwal Lee

40

{ | 1}Tx x Px

Stability Region (Continued)

Stability region of the system ( 1) ( )k Fx kx

with respect to P is defined as:

Stability Region with Constraints

1 1Tia x i l State constraints

Control input constraints 1 1Tjb u j r

1 1 . (1)Tmx m … q Can be combined in the

closed loop system as

Lemma: The stability region defined above satisfy constraints (1) iff 1 1,T

m mP 1 .m … q