Download pdf - SERENE 2014 School: Gabor karsai serene2014_school

Resilience in Cyber-Physical Systems:

Challenges and Opportunities

Gabor Karsai

Institute for Software-Integrated Systems

Vanderbilt University

SERENE 2014 – Autumn School

Acknowledgements

People: Janos Sztipanovits, Daniel Balasubramanian,

Abhishek Dubey, Tihamer Levendovszky, Nag Mahadevan,

and many others at the Institute for Software-Integrated

Systems @ Vanderbilt University

Sponsors: AFRL, DARPA, NASA, NSF through various

programs

Outline

Introduction

Cyber-physical Systems

Resilience

Building resilient CPS

System-level fault diagnostics

Software health management

Resilient architectures and autonomy

Conclusions

Cyber-Physical Systems

What is a Cyber-Physical System?

An engineered system that integrates physical and cyber

components where relevant functions are realized

through the interactions between the physical and cyber

parts.

Physical = some tangible, physical device + environment

Cyber = computational + communicational

Courtesy of Kuka Robotics Corp.

Cyber-Physical Systems (CPS): Integrating networked computational

resources with physical systems

Courtesy of Doug Schmidt

Power generation and distribution

Courtesy of General Electric

Military systems:

E-Corner, Siemens

Transportation (Air traffic control at SFO) Avionics

Telecommunications

Factory automation

Instrumentation (Soleil Synchrotron)

Daimler-Chrysler

Automotive

Building Systems

Courtesy of Ed Lee, UCB

CPS Examples

CPS Examples

CPS Challenge Problem: Prevent This

A Typical Cyber-Physical System

Printing Press

• Application aspects

• local (control)

• distributed (coordination)

• global (modes)

• Ethernet network

• Synchronous, Time-Triggered

• IEEE 1588 time-sync protocol

• High-speed, high precision

• Speed: 1 inch/ms (~100km/hr)

• Precision: 0.01 inch

-> Time accuracy: 10us Bosch-Rexroth


Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html

Example – Flying Paster


S.e n so r top dead center

Active paper feed

Paper cutt,er

Idle roller

!Flyi ng R,.~_$J.fil:.

Drive rol l,er

Dancer

Idle roller

!D rive ro ller

~--------------------------------------------------------------------------------------------------------------------------

http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html







Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html

Flying Paster









Example: Medical Devices Emerging direction: Cell phone based medical devices for affordable healthcare e.g. “Telemicroscopy” project at Berkeley

e.g. Cell-phone based blood testing device developed at UCLA


Example: Toyota autonomous vehicle technology roadmap, c. 2007

Source: Toyota Web site

DARPA Robotics Challenge

http://www.theroboticschallenge.org/




The Good News…

Rich time models

Precise interactions across highly extended spatial/temporal dimension

Flexible, dynamic communication mechanisms

Precise time-variant, nonlinear behavior

Introspection, learning, reasoning

Elaborate coordination of physical processes

Hugely increased system size with controllable, stable behavior

Dynamic, adaptive architectures

Adaptive, autonomic systems

Self monitoring, self-healing system architectures and better safety/security guarantees.

Computing/Communication Integrated CPS

Networking and computing delivers unique precision and flexibility in

interaction and coordination

…and the Challenges

Cyber vulnerability

New type of interactions across highly extended spatial/temporal dimension

Flexible, dynamic communication mechanisms

Precise time-variant, nonlinear behavior

Introspection, learning, reasoning

Physical behavior of systems can be manipulated

Lack of composition theories for heterogeneous systems: much unsolved problems

Vastly increased complexity and emergent behaviors

Lack of theoretical foundations for CPS dynamics

Verification, certification, predictability has fundamentally new challenges.

Computing/Communication Integrated CPS

Fusing networking and computing with physical processes brings new

problems

Abstraction layers allow

the verification of

different properties .

Key Idea: Manage design complexity by creating abstraction

layers in the design flow.

Abstraction layers define

platforms.

Physical Platform

Software Platform

Computation/Communication Platform

Abstractions are linked

through mapping.

Claire Tomlin, UC Berkeley

Example for a CPS Approach

Software models

Real-time system models

implementation correctness:

timing analysis (P)

Sifakis at al: “Building Models of Real-Time

Systems from Application Software,”

Proceedings of the IEEE Vol. 91, No. 1. pp.

100-111, January 2003

))(())(( , out inR ffE

OutTInTf 2 :

OutT

RRRInTf

2 :

PfE R ),( ),( ,

• : reactive program. Program execution creates a mapping between logical-time inputs and outputs.

• : real-time system. Programs are packaged into interacting components. Scheduler control access to computational and communicational resources according to time constraints P

f

Rf

In CPS, essential system properties

such as stability, safety,

performance are expressed in

terms of physical behavior

Abstraction layers and models:

Real-time Software

Physical models

OutT

RRRInTp

2 : OutT

RRRInTf

2 :;

Software models

Real-time system models

implementation correctness:

timing analysis (P)

))(())(( , out inR ffE

OutTInTf 2 :

OutT

RRRInTf

2 :

PfE R ),( ),( ,

Re-defined Goals:

• Compositional verification of

essential dynamic properties

− stability − safety

• Derive dynamics offering

robustness against

implementation changes,

uncertainties caused by faults

and cyber attacks

− fault/intrusion induced reconfiguration of SW/HW − network uncertainties (packet drops, delays)

• Decreased verification

complexity

implementation

Abstraction layers and models:


Why is CPS Hard?

package org.apache.tomcat.session;

import org.apache.tomcat.core.*;

import org.apache.tomcat.util.StringManager;

import java.io.*;

import java.net.*;

import java.util.*;

import javax.servlet.*;

import javax.servlet.http.*;

/**

* Core implementation of a server session

*

* @author James Duncan Davidson [[email protected]]

* @author James Todd [[email protected]]

*/

public class ServerSession {

private StringManager sm =

StringManager.getManager("org.apache.tomcat.session");

private Hashtable values = new Hashtable();

private Hashtable appSessions = new Hashtable();

private String id;

private long creationTime = System.currentTimeMillis();;

private long thisAccessTime = creationTime;

private long lastAccessed = creationTime;

private int inactiveInterval = -1;

ServerSession(String id) {

this.id = id;

}

public String getId() {

return id;

}

public long getCreationTime() {

return creationTime;

}

public long getLastAccessedTime() {

return lastAccessed;

}

public ApplicationSession getApplicationSession(Context context,

boolean create) {

ApplicationSession appSession =

(ApplicationSession)appSessions.get(context);

if (appSession == null && create) {

// XXX

// sync to ensure valid?

appSession = new ApplicationSession(id, this, context);

appSessions.put(context, appSession);

}

// XXX

// make sure that we haven't gone over the end of our

// inactive interval -- if so, invalidate and create

// a new appSession

return appSession;

}

void removeApplicationSession(Context context) {

appSessions.remove(context);

}

/**

* Called by context when request comes in so that accesses and

* inactivities can be dealt with accordingly.

*/

void accessed() {

// set last accessed to thisAccessTime as it will be left over

// from the previous access

lastAccessed = thisAccessTime;

thisAccessTime = System.currentTimeMillis();

}

void validate()

Software Control Systems

Crosses Interdisciplinary Boundaries

• Disciplinary boundaries need to be realigned • New fundamentals need to be created • New technologies and tools need to be developed • Education need to be restructured

Resilience

Cyber-Physical Systems:

Software Intensive Systems

Embedded software ….

is a crucial ingredient in modern

systems

is the ‘universal system integrator’

could exhibit faults that lead to

system failures

complexity has progressed to the

point that zero-defect systems

(containing both hardware and

software) are very difficult to build

need to evolve while in operation

The challenge is to build software intensive systems that

anticipate change: uncertain environments, faults, updates, and

exhibit resilience: they survive and adapt to changes, while

being dependably functional.

Resilience

Webster:

Capable of withstanding shock without permanent deformation or

rupture

Tending to recover from or adjust easily to misfortune or change.

Technical:

The persistence of the avoidance of failures that are unacceptably

frequent or severe, when facing changes. [Laprie, ‘04]

A resilient system is trusted and effective out of the box in a wide

range of contexts, and easily adapted to many others through

reconfiguration or replacement. [R. Neches, OSD]

Resilient system detects anomalies in itself, diagnoses

its causes, and is able to recover lost functionality.

Research issues

•Model-driven engineering of Resilient Software Systems

•Design-time + Run-time aspects

•Resilience to: (1) faults, (2) environmental changes, (3) updates

•Target system category: distributed, real-time, embedded systems

Objective: Model-based

engineering approach

and tools to build

verifiably resilient

systems

The definition…

Building Resilient Cyber-Physical

Systems

Physical Interaction

Computational Interaction

Refin

em

en

t/Co

mp

ilatio

n

Ab

stra

cti

on

Platform Layer

Physical Layer

Physical

Object Cyber-Physical

Object

Physical

Object Physical

Object

Cyber-Physical

Object

Computational

Object

Computational

Object

Computational

Object

Computational

Object

Computational

Object

Communication Platform Computational

Platform

Computational

Platform

Implementation Implementation


Layers and Interactions Fault

Fault

Fault

Fault


Faults and resilience

In CPS faults can appear in (and cascade to) any place

Physical system

Hardware (computing and communication) system

Software (application and platform) system

In CPS physical and cyber elements are integrated

Many interaction pathways: P2P, P2C, C2C, P2C2P, C2P2P2C

Many modeling paradigms for physical systems

Consider engineering or physics!

Heterogeneous models need to be integrated for detailed analysis

In CPS recovery can take many forms

Physical action

Cyber restart

Software adaptation

CPS and Model-based Design

Design of CPS layers via MDE

Software models

Platform models

Physical models

Challenge: How to integrate the models so that cross-domain

interactions can be understood and managed?

A Strategy for Resilient CPS Overall scheme:

Faults can originate in any layer of a hierarchy, in any component

Anomalies caused by the fault can be detected in the same or a higher layer

Based on anomalies a fault source isolation (diagnosis) is performed. The diagnosis result may be reported to a higher layer, depending on the nature of the fault.

The fault is locally mitigated first, but when that mitigation fails the higher layer is informed about the anomaly, the diagnosed fault, and the mitigation action taken.

High-level view: Fault management is a control problem. Faults are disturbances in the system whose effects prevent the system to

provide the required service/s

Anomalies are the sensory inputs, mitigation actions are the actuators of the fault management system

Fault mitigation must happen by considering (1) the current functional goals and (2) the actual state of the system, on the right level of abstraction

A Strategy for Resilient CPS

Layered fault management

Concepts:

1. Faults propagate to neighboring layers via

guaranteed behaviors

2. Each layer includes pro-active and reactive fault

management mechanisms

Each layer provides a ‘fault reporting’ and

‘fault management’ interface

Fault management services are built into the

‘middleware’:

Temporal/spatial Isolation

Fault Tolerant Clock Sync

Time-triggered Communications

Group Communication and Transactions

Fault-tolerant Resource Sharing

Component/Service Migration

Primary/Backup

Replication

Autonomous Failure Management

Safe Dynamic Composition of Components

System-level Fault Diagnostics

The need for resilience

In complex systems even simple

failures lead to complex cascades of

events that are difficult to understand

and manage.

How to

•detect and isolate faults?

•react to faults to mitigate their effect?

FACT: A model-driven toolsuite for system-level diagnostics

Run-time Platform (RTOS)

Visual modeling tool for creating:

•System architecture models

•Timed failure propagation graph models

Modular run-time environment contains: •Monitors detect anomalies in sensor data and track mode changes •TFPG Diagnostics Engine performs diagnosis and isolates the source(s) of observed anomalies •Reports are generated for operators and maintainers Modules can be used standalone on an embedded target processor with an RTOS

TFPG DIAGNOSTICS ENGINE

MO

NIT

OR

S

OPER

ATO

R

MA

INTA

INER

Modeling Language

Temporal Failure Propagation Graphs

•Failure modes

•Discrepancies

•Monitors/Alarms

•Propagation links with:

•Time delay

•Mode

Fault model:

Known physical failure modes whose functional effects (discrepancies) are

monitored.

Diagnostic problem:

Given a set of active monitors and their temporal activation sequence,

which failure mode(s) explain the observations?

A causal network-like

model describing how

component failure effects

propagate across the

system activating

monitors.

Failure propagation links

and monitors could be

mode-dependent.

Modeling Language

Temporal Failure Propagation Graphs

Modeling variants

•Untimed, causal network (no modes, propagation = [0..inf])

•Modal networks: edges are mode dependent

•Timed models

•Hierarchical component models

Nodes:

•Failure modes

•Discrepancies

• AND/OR

• Monitored (option)

Edges:

•Propagation delay: [min, max]

•Discrete Modes (activation)

Example models (#components,#failuremodes,#alarms)

•Trivial examples

•Simplified fuel system (~30,~80,~100)

•Realistic fuel system (~200,~400,~600)

•Aircraft avionics (~2000,~8000,~25000) – generated

TFPG Example

Example

Not shown:

- Timing on propagation links

- Components/hierarchy

- Modal propagation

TFPG captures cause-effect

relationships that can be modal

and temporal. Effects may be

cumulative and/or monitored.

Component

Legend

Timed Failure Propagation Graphs

Causal models that describe the system behavior in presence of faults.

Model is a labeled directed graph where Nodes represent either failure modes or discrepancies

Edges between nodes in the graph represent causality

Edges are attributed with timing and mode constraints on failure propagation.

A discrepancy can be either monitored unmonitored.

The monitor detects a sensory manifestation of an anomaly and generates alarms.

Failure Cascades

Propagation

links Alarm

Allocation Failure Modes Discrepancies Alarms

TFPG Example

Rocket Engine

TFPG Example

Propellant Tank

TFPG Example Turbo Pump

TFPG Example

Combustion Chamber

TFPG Reasoning

On-line diagnostics:

Input: Sequence of alarms and mode changes

Output: Sequence of sorted and ranked hypotheses containing failure mode(s)

that explain the observations (alarms, mode changes)

TFPG Hypothesis

TFPG Hypothesis: estimation of the current system state.

Directly, points to failure modes that “best” explain the

current set of observed alarms.

Indirectly, points to failed monitored discrepancies; those

with a state that is inconsistent with the (hypothesized) state

of the failure modes

Structure

List of possible Failure Modes

List of alarms in each set ( Consistent (C)/ Inconsistent

(IC)/Missing (M) / Expected (E))

Metrics : Plausibility/ Robustness/ Failure Rate

Hypotheses Evaluation Metrics

Hypotheses are evaluated based on the following

metrics:

Plausibility: reflects the support of a hypothesis based on the

current observed alarm state. It answers the question: Which

hypothesis to consider?

Robustness: reflects the potential of a hypothesis (evidence)

to change based on remaining alarms. It answers the question:

When to take an action?

Failure Rate: is a measure of how often a particular failure

mode has occurred in the past.

Run-time System

Diagnostics Engine

Algorithm outline:

Check if new evidence is explained by

current hypotheses.

If not, create a new hypothesis that

assumes a hypothetical state of the

system consistent with observations

Rank hypotheses for plausibility and

robustness

Discard low-rank hypotheses, keep

plausible ones

Fault state: ‘total state vector’ of the

system, i.e. all failure modes and

discrepancies

Alarms could be

Missing: should have fired but did not

Inconsistent: fired, but it is not consistent

with the hypothesis

Robust diagnostics: tolerates missing and

inconsistent alarms

Metrics:

Plausibility: how plausible is the

hypothesis w.r.t alarm consistency

Robustness: how likely is that the

hypothesis will change in the future

Run-time System

Diagnostics Engine

Novel properties:

Multi-fault hypothesis is the default

Fault state == State of all failure

modes/discrepancies

Reasoner works with sets of failure modes

(instead of individual failure modes)

Robust algorithm: can tolerate

missing/inconsistent alarms

Parsimony principle: Use simplest

explanation

Time-dependent diagnosis

Reasoner can be asked to recompute

diagnosis upon the advance of time

Extensions:

Modal edges: Propagation happens only if

edge is enabled (controlled by system

mode)

Diagnosis takes into consideration the last

propagation effect

Non-monotonic alarms:

Alarm retraction triggers a re-computation

of the diagnosis

Run-time System

Discrete (TFPG) Diagnostics

Additional capabilities:

Intermittent failure modes

Consequence: alarm/s change to ‘Off’

Assumption: low frequency intermittents

Upon alarm changing to ‘Off’, backtrack to

last change to ‘On’ and re-evaluate

Maintain alternate branches (for alarms ‘On’

and ‘Off’)

Test alarms: can be considered only

after activation

If inactive, it is an un-monitored

discrepancy.

If activated, it is used but timing may be

inconsistent (re: parent’s timing)

Metrics summary:

Plausibility:

Robustness:

For n failure modes and m discrepancies, maximum number of hypotheses is nm but more likely to be O(n).

Updating hypothesis is polynomial with the number of nodes and exponential w.r.t sensor faults.

Model #C #FM #D #A #M #P #R Avg.

Time (sec)

#1 15 36 48 21 0 120 1 0.000311

#2 11 36 120 174 27 3 1 0.000445

#3 153 481 1973 270 9 3409 1 0.013589

#4 24 64 116 116 0 695 4 0.016

#5 21 100 282 69 0 431 18 0.00288

• Keys: #C – Number of Components / #FM – Failure Modes/ #D – Discrepancies/ #A – Alarms/ #M –

Modes/ #P – Propagation links, #R – Regions

• Avg. Time = Average Computational Time taken by the reasoner (in seconds) after every event on

2.67GHz Intel Xeon® CPU, 8 GB RAM.

Performance Evaluation

Tool Operations

1. Modeling 2. Desktop experimentation,

validation

4. Deployment on

embedded platform

3. Feedback

Model

Interpretation

Software Health-Management

Motivation: Software as Failure Source?

Qantas 72 - Oct 7, 2008 – A330 (Australia) – ATSB Report At 1240:28, while the aircraft was cruising at 37,000 ft, the autopilot disconnected. From about

the same time there were various aircraft system failure indications. At 1242:27, while the crew was evaluating the situation, the aircraft abruptly pitched nose-down. The aircraft reached a maximum pitch angle of about 8.4 degrees nose-down, and descended 650 ft during the event. After returning the aircraft to 37,000 ft, the crew commenced actions to deal with multiple failure messages. At 1245:08, the aircraft commenced a second uncommanded pitch-down event. The aircraft reached a maximum pitch angle of about 3.5 degrees nose-down, and descended about 400 ft during this second event. At 1249, the crew made a PAN urgency broadcast to air traffic control, and requested a clearance to divert to and track direct to Learmonth. At 1254, after receiving advice from the cabin of several serious injuries, the crew declared a MAYDAY. The aircraft subsequently landed at Learmonth at 1350.

The investigation to date has identified two significant safety factors related to the pitch-down movements. Firstly, immediately prior to the autopilot disconnect, one of the air data inertial reference units (ADIRUs) started providing erroneous data (spikes) on many parameters to other aircraft systems. The other two ADIRUs continued to function correctly. Secondly, some of the spikes in angle of attack data were not filtered by the flight control computers, and the computers subsequently commanded the pitch-down movements.

http://www.atsb.gov.au/publications/investigation_reports/2008/AAIR/pdf/AO2008070_interim.pdf

http://www.atsb.gov.au/publications/investigation_reports/2008/AAIR/pdf/AO2008070_interim.pdf

Understanding the Problem

Embedded software is a complex engineering artifact that can have latent

faults, uncaught by testing and verification. Such faults become apparent

during operation when unforeseen modes and/or (system) faults appear.

The problem:

General: How to construct a Software Health Management system that

detects such faults, isolates their source/s, prognosticates their progression,

and takes mitigation actions in the system context?

Specific: How to specify, design, and implement such a system using a model-

based framework?

The larger picture:

General: Software Health Management must be integrated with System

Health Management – ‘Software Health Effects’ must be understood on the

System (Vehicle) Level.

What is ‘Systems Health Management’ ?

The ‘on-line’ view:

1. Detection of anomalies in system or component behavior

2. Identification and isolation of the fault source/s

3. Prognostication of impending faults that could lead to system failures

4. Mitigation of current or impending fault effects while preserving mission objective/s

Detection

Isolation

Prognostics

Mitigation

Observations Corrections

Reports

Examples:

- Automotive OBD (detection)

- Boeing 777 CMC (detection + isolation)

- Spacecraft fault protection (detection + isolation + mitigation)

- Aircraft fleet (detection + isolation + prognostics)

Software Health Management

Software is a complex

engineering artifact.

Software can have latent faults.

Faults appear during operation

when unforeseen modes or

interactions happen.

Techniques like Voting and Self-

Checking pairs have

shortcomings

Common mode faults

Fault cascades

• SHM is the extension of FDIR

techniques used in Physical systems to

Software.

Fault detection

Stimuli Responses

Environmental

Assumptions

Domain

Assumptions

Observed

Behavior

Observed Inputs Fault isolation

Fault mitigation

Why ‘Software Health Management’?

Complexity of systems necessitates an additional layer ‘above’ SFT that manages ‘Software Health’

Embedded software …. is a crucial ingredient in aerospace systems

is a method for implementing functionality

is the ‘universal system integrator’

could exhibit faults that lead to system failures

complexity has progressed to the point that zero-defect systems (containing both hardware and software) are very difficult to build

Systems Health Management is an emerging field that addresses precisely this problem: How to manage systems’ health in case of faults ?

‘Software Health Management’ is not… A replacement for existing and robust engineering processes and standards

(DO-178B)

A substitute for hardware- and software fault tolerance

An ‘ultimate’ solution for fault tolerance

Software Health Management

Key ideas

Use software components as units of fault management: detection, diagnosis,

and mitigation

Components must be observable, provide fault isolation, and be capable of mitigation

Use a two-level architecture:

Component level: detect anomalies and mitigate locally

System level: received anomaly reports, isolate faulty component(s), and mitigate

on the component

Use models to represent

anomalous conditions

fault cascades

mitigation actions (when / what)

Use model-based generators to synthesize code artifacts

Developer can use higher-level abstractions to design and implement the

software health management functions of a system

Software Component Framework

The Component Model should enable:

Monitoring

Interfaces (synchronous/asynchronous calls)

Component state

Scheduling and timing (WCET)

Resource usage

Anomaly Detection via:

Pre/post conditions over call parameters, rates, and component state

Conditions over (1) timing properties, (2) resource usage (e.g. memory footprint), and (3)

usage patterns

Combinations of the above

Mitigation:

Given detected anomaly and state of the component take action

Can be time- or event-triggered

Actions: restart, initialize, block call, inject value, inject call, release resource, modify state;

checkpoint/restore, combination of the above

Notional Component Model

A component is a unit (containing potentially many objects). The component is parameterized, has

state, it consumes resources, publishes and subscribes to events, provides interfaces to

and requires interfaces from other components.

Publish/Subscribe: Event-driven, asynchronous communication (publisher does not wait)

Required/Provided: Synchronous communication using call/return semantics.

Triggering can be periodic or sporadic.

Extension of a Component Model defined by OMG (CCM) : state, resource, trigger interfaces.

Resource Trigger

Subscribe

(Event) Publish

(Event)

Provided

(Interface) Required

(Interface)

State

Parameter

Component

Example: Component Interactions

Components can interact via asynchronous/event-triggered and synchronous/call-driven connections.

Example: The Sampler component is triggered periodically and it publishes an event upon each

activation. The GPS component subscribes to this event and is triggered sporadically to obtain

GPS data from the receiver, and when ready it publishes its own output event. The Display

component is triggered sporadically via this event and it uses a required interface to retrieve the

position data from the GPS component.

Sampler

Component GPS

Component

Display

Component

P S

S

Component Monitoring

Component

Monitor arriving

events

Monitor incoming

calls

Monitor published

events

Monitor outgoing

calls

Observe state

Monitor resource

usage

Monitor control flow/

triggering

ACM:

The ARINC Component Model

Provide a CCM-like layer on top of ARINC-653 abstractions

Notional model:

Terminology:

Synchronous: call/return

Asynchronous: publish-return/trigger-

process

Periodic: time-triggered

Aperiodic: event-triggered

Note:

All component interactions are realized via the framework

Process (method) execution time has deadline, which is monitored

ACM:

The ARINC Component Model

Each ‘input interface’ has its own process

Process must obtain read-write/lock on component

Asynchronous publisher (subscriber) interface:

Listener (publisher) process

Pushes (receives) one event (a struct), with a validity flag

Can be event-triggered or time-triggered (i.e. 4 variations)

Synchronous provided (required) interface:

Handles incoming synchronous RMI call

Forwards outgoing synchronous RMI call

Other interfaces:

State: to observe component state variables

Resource: to monitor resource usage

Trigger: to monitor execution timing

ACM:

A Prototype Implementation

ARINC-653 Emulator

Emulates APEX services using Linux API-s

Partition Process, Process Thread

Module manager: schedules partition set

Partition level scheduler: schedules threads within partition

CORBA foundation

CCM Implementation

No modification

ACM component interactions

Mainly implemented via APEX

RMI interactions use threads

Implementation: Mapping ACM to APEX

APEX - Abstractions Platform (Linux)

Module Host/Processor

Partition Process

Process Thread

ACM: APEX Component Model APEX APEX Concept Used

Component method Periodic Periodic process Process start, stop

Semaphores Sporadic Aperiodic process

Invocation Synchronous

Call-Return

Periodic

Target

Co-located N/A

Non-co-located N/A

Sporadic

Target

Co-located Caller method signals callee to release

then waits for callee until completion.

Event, Blackboard

Non-co-located Caller method sends RMI (via CM) to

release callee then waits for RMI to

complete.

TCP/IP, Semaphore,

Event

Asynchronous

Publish-Subscribe

Periodic

Target

Co-located Callee is periodically triggered and polls

‘event buffer’ – validity flag indicates

whether data is stale or fresh

Blackboard

Non-co-located Sampling port, Channel

Sporadic

Target

Co-located Callee is released when event is available Blackboard,

Semaphore, Event

Non-co-located Caller notifies via TCP/IP, callee is

released upon receipt

Queuing port,

Semaphore, Event

ACM:

Modeling Language Modeling elements:

Data types: primitive, structs, vectors

Interfaces: methods with arguments

Components:

Publish/Subscribe ports (with data type)

Provided/Required interfaces (with i/f type)

Health Manager

Assemblies

Deployment

Modules, Partitions

Component Partition

Example: Sensor/GPS/Display

get

gps_data_sourcedata_in

invokes

component NavDisplay {

consumes SensorOutput data_in ; //APERIODIC

uses GPSDataSource gps_data_source ;} ;

data_out

component Sensor {

Publishes SensorOutput data_out ; };

data_out

get

gps_data_src

GPSValue

data_in

reads

invokes

readsupdates

component GPS {

publishes SensorOutput data_out ; //APERIODIC

consumes SensorOutput data_in ; //PERIODIC

provides GPSDataSource gps_data_src ; };

Sensor

GPS

Nav Display

struct SensorOutput

{

Timespec time ;

SensorData data ;

};

struct SensorData

{

FLOATINGPOINT alpha ;

FLOATINGPOINT beta ;

FLOATINGPOINT gamma ;

};

struct Timespec

{

LONGLONG tv_sec ;

LONGLONG tv_nsec ;

};

interface GPSDataSource

{

void getGPSData (out GPSData d);

};

Component Port Period Time Capacity Deadline

Sensor data_out 4 sec 4 sec Hard

GPS data_out aperiodic 4 sec Hard

GPS data_in 4 sec 4 sec Hard

GPS gps_data_src aperiodic 4 sec Hard

Navdisplay data_in aperiodic 4 sec Hard

Navdisplay gps_data_src aperiodic 4 sec Hard

Anomaly Detection

Model-Based Specification of monitoring expressions Post/Pre condition violations:

threshold, rate, custom filter (moving average)

Resource Violations: Deadline

Validity Violation: Stale data on a consumer

Concurrency Violations: Lock timeouts.

User code violations: reported error conditions from application code.

Code Generators Synthesize code for

implementing the monitors

Monitor

arriving events

Monitor

incoming calls

Monitor published

events

Monitor outgoing

calls

Observe state Monitor resource

usage

Monitor control

flow/ triggering

Port Monitors

Non-Port Monitors

• Based on these local detection, each component developer can implement a local health manager

• It is a reactive timed state machine with pre specified actions.

• All alarms, actions are reported to the system health manager

ACM:

Modeling Language: Monitoring Monitoring on component interfaces

Subscriber port ‘Subscriber process’ and

Publisher port ‘Publisher process’

Monitor: pre-conditions and post-conditions

On subscriber: Data validity (‘age’ of data)

Deadline (hard / soft)

Provided interface ‘Provider methods’ and

Required interface ‘Required methods’

Monitor: pre-conditions and post-conditions

Deadline (hard / soft)

Can be specified on a per-component basis

Monitoring language:

Simple, named expressions over input (output)

parameters, component state, delta(var), and

rate(var,dt). The expression yields a Boolean condition.

74

Component-level Health Management Manager’s behavioral model:

Finite-state machine

Triggers: monitored events, time

Actions: mitigation activities

Manager is local to component

container (for efficiency) but shall be

protected from the faults of functional

components

Notional behavior:

Track component state changes via

detected events and progression of

time

Take mitigation actions as needed

Design issues:

Co-location with component (fault containment)

Local detection may implicate another component

Component

Component Framework

Monitor

Man

ager

Actions

Events

Events

Idle

Exec

InvA

WCET

start

finish

timeout

/init

invA_violation

/reset

ACM - Modeling Language:

Component Health Manager

Reactive Timed State Machine

Event trigger:

Predefined conditions (e.g. deadline violation, data validity validation)

User-defined conditions (e.g. pre-condition violation)

Reaction: mitigation action (start, reset, refuse, ignore, etc.)

State: current state of the machine

(Event X State) Action

Component Health Management Available Actions

HM Response

Component

NOMINAL ERROR CHECKRESULT FAILURE

ErrorMessage /Action

Action Successful

Timeout orAction Failed

Component Health Manager (High priority ARINC-653 process)

BUFFER

IncomingEvents

Process 1Process 3

Component Port (653 PRocess)

HM ResponseBlackBoard

BlackBoardBlackBoard

Blocking Read

Architecture

Assembly Definition

Model-Based Software Health Management 78

The Sensor component is triggered periodically and it publishes an event upon each activation.

The GPS component subscribes to this event and is triggered periodically to obtain GPS data from the receiver. It publishes its own output event.

The Nav Display component is triggered sporadically via this event and it uses a required interface to retrieve the position data from the GPS component.

Validity(GPS.data_in)<4ms

Delta(Nav.data_in.time)>0

Rate(gps_data_src.data)>1

Specified Monitoring Conditions

System-level Health Management

Focus issue: Cascading faults

Hypothesis: Fault effects cascade via component interactions

Anomalies detected on the component level are not

‘diagnosed’ can be caused by other components

Problem:

How to model fault cascades?

How to diagnose and isolate fault cascade root causes?

How to mitigate fault cascades?

Recap: Fault diagnosis

Model: Timed Failure Propagation Graphs

Modeling variants

•Untimed, causal network (no modes, propagation = [0..inf])

•Modal networks: edges are mode dependent

•Timed models

•Hierarchical component models

Nodes:

•Failure modes

•Discrepancies

• AND/OR (combination)

• Monitored (option)

Edges:

•Propagation delay: [min, max]

•Discrete Modes (activation)

Example models (#components, #failuremodes, #alarms)

•Trivial examples

•Simplified fuel system (~30,~80,~100)

•Realistic fuel system (~200,~400,~600)

•Aircraft avionics (~2000,~8000,~25000) – generated

Recap: Fault diagnosis

• Outline:

– Check if new evidence is explained

by current hypotheses.

– If not, create a new hypothesis that

assumes a hypothetical state of the

system consistent with observations

– Rank hypotheses for plausibility and

robustness metrics

– Discard low-rank hypotheses, keep

plausible ones Fault state: ‘total state vector’ of the system,

i.e. all failure modes and discrepancies

Alarms could be

Missing: should have fired but did not

Inconsistent: fired, but it is not consistent

with the hypothesis

Robust diagnostics: tolerates missing and

inconsistent alarms

Metrics:

Plausibility: how plausible is the

hypothesis w.r.t. alarm consistency

Robustness: how likely is that the

hypothesis will change in the future

Fault diagnosis algorithm:

Modeling Cascading Faults

Not needed - the cascades can be computed from the

component assemblies, if the anomaly types and their

interactions are known.

Component ‘elements’

Every method belongs to one of these (7)

Fault cascades within component

(A few of the 38 patterns)


Inter-component propagation is regular – always follows the same pattern

Intra-component propagation depends on the component! Need to model internal dataflow and control flow of the component.

Note: Could be determined via source code analysis.


Fault Propagation Graph for GPS Example

Here: hand-crafted, but it is generated automatically in the

system

System-level Fault Mitigation Model-based system-level mitigation engine

Model-based diagnoser is automatically generated

Designer specifies fault mitigation

strategies using a reactive state machine

Component Platform

Managed Component

Component CHM

Managed Component

Component CHM

Component Fault Model Component Fault Model

FM FM

FM

FM

D D

FM D D

D

D

D

Diagnoser Engine Mitigation Engine Advantages:

• Models are higher-level

programs to specify

(potentially complex)

behavior – more readable and

comprehensible

•Models lend themselves to

formal analysis – e.g. model

checking

System-level Fault

Mitigation

Model-based mitigation specification at

two levels

Component level: quick action

System level: Reactive action taking the

system state into consideration

System designer specifies them as a

parallel timed state machine.

Fixed set of mitigation actions are

available

Runtime code is generated from

models

Advantages:

Models are higher-level programs to

specify (potentially complex) behavior –

more readable and comprehensible

Models lend themselves to formal

analysis – e.g. model checking

Diagnoser Engine Mitigation Engine

HM Action Semantics

CLHM: IGNORE Continue as if nothing has happened

CLHM:ABORT Discontinue current operation, but opera-

tion can run again

CLHM: USE PAST DATA

Use most recent data (only for operations that expect fresh data)

CLHM: STOP Discontinue current operation

Aperiodic processes (ports): operation can run again

Periodic processes (ports): operation must be enabled by a future START HM action

CLHM: START Re-enable a STOP-ped periodic operation

CLHM RESTART A Macro for STOP followed by a START for the current operation

SLHM: RESET Stop all operations, initialize state of component, clear all queues. start all periodic operations

SLHM: STOP Stop all operations

Alarms Alarms

Alarms

List of predefined Mitigation Actions

System-level Health Management

Functional components

1. Aggregator:

Integrates (collates) health information coming

from components (typically in one hyperperiod)

2. Diagnoser:

Performs fault diagnosis, based on the fault

propagation graph model

Ranks hypotheses

Component that appears in all hypotheses with

the highest rank is chosen for mitigation

3. Response Engine:

Issues mitigation actions to components based

on diagnosis results

Based on a state machine model that maps

diagnostic results to mitigation actions

These components are generated

automatically from the models

The Health Management Approach:

1. Locally detected anomalies are mitigated

locally first. – Quick reactive response.

2. Anomalies and local mitigation actions are

reported to the system level.

3. Aggregated reports are subjected to

diagnosis, potentially followed by a system-

level mitigation action.

4. System-level response commands are

propagated down to components.

Example: 2005 Malaysian Air Boeing 777 in-flight upset

Low airspeed advisory.

Airplane’s autopilot experienced excessive acceleration values.

Vertical acceleration decreased to -2.3g within ½ second

Lateral acceleration decreased to -1.01g (left) within ½ second

Longitudinal acceleration increases to +1.2 g within ½ second

Autopilot pitched nose-up to 17.6 degree and climbed at a vertical speed

of 10,650 fpm.

Airspeed reduced to 241 knots.

Stick shaker activated at top of the climb.

Aircraft descended 4,000 ft.

Re-engagement of autopilot followed by another climb of 2,000 ft.

Maximum rate of climb = 4440 fpm.

B777 ADIRU Architecture • Designed to be serviceable with

one fault in each FCA

• Can fly but maintenance

required upon landing with two

faults in each FCA

• Each ARINC 629 end unit voted

on the processor data bit-by-

bit.

• Processors monitor the ARlNC

629 modules by full data wrap-

around

• Processors also monitor the

power supplies, any one of

which can power the entire unit

• Accelerometer and gyro in

skewed redundant configuration

• A S(econdary)AARU also

provided inertial data

Based on Air Data Inertial Reference Unit (ADIRU)

Architecture (ATSB, 2007, p.5)

Cause of Inflight Upset

June 2001: accelerometer 5 fails with high output value, ADIRU disregards it.

A power cycle on ADIRU occurs. A latent software bug disregards the faulty status

of accelerometer 5.

Status of failed unit was recorded on-board maintenance memory, but that memory was

not checked by the software.

An inflight fault was recorded in accelerometer 6 and it was disregarded.

FDI software allowed use of accelerometer 5.

High acceleration value was passed to all computers.

Due to common-mode nature of fault, voters allowed high accelerometer data to

go on all channels.

This high value was used by primary flight computer.

Mid value select function used by the flight computer lessened the effect of pitch

motion.

Problem: System relied on redundancy to mask a fault. But due to latent software

bug and common-mode fault, the effect cascaded into the system failure Reading Material: The dangers of failure masking in fault-tolerant software: aspects of a recent in-flight upset event

C.W. Johnson and C.M. Holloway, IET Conf. Pub. 2007, 60 (2007), DOI:10.1049/cp:20070442

Case Study • Modeled the architecture as a

software component assembly

• Created the fault scenario

• Only modeled part of the system

to illustrate the point of SHM

• Accelerometers are arranged on

six faces of a dodecahedron.

Used for regression Equations

ADIRU Assembly (Accelerometers)

Runs at 20 Hz

ADIRU Assembly (Processors)

Observer tracks the age

of accelerometer data.

Specified as timed state

machine (with timeout)

Runs at 20 Hz

ADIRU Assembly (Voters)

Runs at 20 Hz

ADIRU Assembly (Display- Mimics PFC)

Runs aperiodically

Deployment Model

Each Module is a processor running the

ARINC Component Runtime Environment

Execution

Accelerometers

Machine – durip02

SHM

Machine – durip09

Voter + Display Computer

Machine – durip06

ADIRU Computers

Machine – durip03

Accelerometers

VOTERS + DISPLAY SHM

ADIRU Processors

System Health Manager

These components are auto generated

The hypothesis generated by the diagnoser is translated to

Component(s) that is most likely faulty. This list is fed to

Response Engine, which triggers the mitigation state machine

other machines have similar specification

Demonstration

Fault Scenario

Accelerometer 5 has initial fault

It is started which causes an alarm

Then Accelerometer 6 develops fault

Successful mitigation

Identifying the faulty components

Stopping the fault components

Processors can still function with four accelerometers.

Demonstration: Faulty Scenario

Resilient architectures and

autonomy

Resilience and autonomy

Model-based Software Health Management

Requires explicit specification of component-level and system-level health management (recovery) actions

Complex and error-prone… too many options!

Resilient systems should recover autonomously

Concepts:

Model the system architecture + functions.

Express what is needed from the system to implement functions.

Embed models into the run-time system

Use a reasoner to figure out how to recover function upon failures

Modeling

Functional Requirements for IMU

Inertial Position

• Determine inertial position.

• Functional Requirement (AND)

GPS Position

Position Tracking

GPS Position

• Sense GPS position for computing

Inertial Position

Position Tracking

• Continuously track position to compute

Inertial position

• Functional Requirement

Body Acceleration Measurement

Body Acceleration Measurement

• Sense body acceleration for Position

Tracking.

Inertial

Position

GPS

Position

Position

Tracking

Body

Acceleration

Measurement

Modeling

Complete Redundant Architecture

Modeling the Architecture

Function Allocation

Body Acceleration

Measurement EXACTLY ONE (Primary /Secondary ADIRU

Subsystem)

ADIRU Subsystem has

• Accelerometers (6)

• ADIRU Computers (4)

• Voters (3)

Functional / Operational ADIRU Subsystem

requires

• ATLEAST 4 of 6 Accelerometers

• ATLEAST 2 of 4 Filters or ADIRU

computers

• ATLEAST 1 of 3 Voter

Inside one ADIRU:


Function Allocation

GPS Position EXACTLY ONE (Primary/Secondary

GPS Subsystem)

GPS Subsystem includes

GPS Receiver (1)

GPS Processor (1)

Functional / Operational GPS

subsystem requires

EXACTLY ONE of GPS Receiver

EXACTLY ONE of GPS Processor

Inside one GPS Subsystem:


Function Allocation

POSITION TRACKING ATLEAST ONE OF ( LEFT/ CENTER/

RIGHT PFC NavFilter Subsystem)

PFC NavFilter Subsystem includes

PFC Nav Filter (1)

PFC Processor (1)

Functional/ Operational Requirement

for PFC Subsystem

EXACTLY ONE PFC NavFilter

EXACTLY ONE PFC Processor

Inside one PFC Subsystem:

Component Operational Requirement

EXPLICIT – Local dependency

Display Subsystem

ATLEAST 1 of 3 Consumers (Left, Center, Right)

EXPLICIT – Local dependency

ADIRU Computer inside ADIRU Subsystem

ATLEAST 4 of 6 Consumer Port

Implies

ATLEAST 4 of 6 Accelerometer Components


IMPLICIT – Inferred dependency

PFC NavFilter in PFC Subsystem

EXACTLY 1 of 1 Consumer Port AND

ATLEAST 1 of 1 Requires Port

Implies

EXACTLY 1 of 2 ADIRU Subsystems AND

ATLEAST 1 of 2 GPS Subsystem


IMPLICIT – Inferred dependency

PFC Processor inside PFC Subsystem

EXACTLY 1 of 1 Consumer Port

Implies

EXACTLY 1 of 1 PFC NavFilter

GPS Processor inside GPS Subsystem

EXACTLY 1 of 1 Consumer Port

Implies

EXACTLY 1 of 1 GPS Receiver

Modeling the problem:

Boolean SAT

Functional Requirements + Function allocation +

Component operational requirements + Component states

Encoded as Boolean (CNF) Expression for SATisfiability

problem

Solution: valid component architecture

Size: #Variables: 493/ #Clauses: 1776 FAULT / Scenario SAT solver -

RECONFIG

COMPUTE

Time (s)

RECONFIG

COMMANDS

Verifying Initial State 0.004228 No commands. Initial State accepted as satisfying/

meeting functional requirements.

Initial Configuration

Fault: ADIRU Accelerometer

Fault introduced, anomaly detected, fault source

component diagnosed, then:

Compute the new component architecture that satisfies

the functional requirements AND minimizes the number

of reconfiguration changes

FAULT / Scenario SAT solver -

RECONFIG

COMPUTE

Time (s)

RECONFIG

COMMANDS

Primary_

ADIRU_Subsystem_

Accelerometer6

0.002989

STOP Primary_ADIRU_Subsystem_Accelerometer6

Primary_

ADIRU_Subsystem_

Accelerometer5

0.003151


Primary ADIRU Subsystem

Partial fault – Primary still functional


RECONFIG

COMPUTE

Time (s)

RECONFIG

COMMANDS

Primary_

ADIRU_Subsystem_

Accelerometer4

0.020825


STOP Primary_ADIRU Subsystem

(stop all accelerometers, ADIRU computers, Voters in Primary

ADIRU subsystem)

START Secondary_ADIRU Subsystem

(start all accelerometers, ADIRU computers,

Voters in Secondary ADIRU subsystem)

ADIRU Accelerometer Fault

(contd.)

3rd fault failover to secondary ADIRU

Primary ADIRU Subsystem

Complete fault

Primary ADIRU Subsystem Faulty

Failover to secondary ADIRU

GPS Fault


RECONFIG

COMPUTE

Time (s)

RECONFIG

COMMANDS

Primary_

GPS_Subsystem_

GPSProcessor

0.004720

STOP Primary_GPS_Subsystem

(stop GPS Receiver, GPS Processor)

START Secondary_GPS Subsystem

(start GPS Receiver, GPS Processor)

Primary GPS Subsystem Faulty

Reconfiguration after

Primary GPS Subsystem becomes faulty

PFC NavFilter Faults


RECONFIG

COMPUTE

Time (s)

RECONFIG

COMMANDS

Left_

PFC_Subsystem_

PFCNavFilter

0.003107

STOP

Left_PFC_Subsystem

( stop PFCNavFilter, PFC Processor)

Right_

PFC_Subsystem_

PFCNavFilter

0.003089

STOP

Right_PFC_Subsystem

( stop PFCNavFilter, PFC Processor)

Left PFC NavFilter Faulty

Right PFC NavFilter Faulty

Research challenges

Modeling and engineering of r-CPS

Modeling paradigm / verification paradigm / synthesis

Verify recoverability under all scenarios

Efficient recovery

Analytics:

Comparing architectures and solutions

Resilience against…

Cascading, cross-domain faults

Cyber attacks possibly with physical faults

Engineering process

‘Simian army’ or systematic design?

Principles of multi-layer resilience