Resilience in Cyber-Physical Systems:
Challenges and Opportunities
Gabor Karsai
Institute for Software-Integrated Systems
Vanderbilt University
SERENE 2014 – Autumn School
Acknowledgements
People: Janos Sztipanovits, Daniel Balasubramanian,
Abhishek Dubey, Tihamer Levendovszky, Nag Mahadevan,
and many others at the Institute for Software-Integrated
Systems @ Vanderbilt University
Sponsors: AFRL, DARPA, NASA, NSF through various
programs
Outline
Introduction
Cyber-physical Systems
Resilience
Building resilient CPS
System-level fault diagnostics
Software health management
Resilient architectures and autonomy
Conclusions
Cyber-Physical Systems
What is a Cyber-Physical System?
An engineered system that integrates physical and cyber
components where relevant functions are realized
through the interactions between the physical and cyber
parts.
Physical = some tangible, physical device + environment
Cyber = computational + communicational
Courtesy of Kuka Robotics Corp.
Cyber-Physical Systems (CPS): Integrating networked computational
resources with physical systems
Courtesy of Doug Schmidt
Power generation and distribution
Courtesy of General Electric
Military systems:
E-Corner, Siemens
Transportation (Air traffic control at SFO) Avionics
Telecommunications
Factory automation
Instrumentation (Soleil Synchrotron)
Daimler-Chrysler
Automotive
Building Systems
Courtesy of Ed Lee, UCB
CPS Examples
CPS Examples
CPS Challenge Problem: Prevent This
A Typical Cyber-Physical System
Printing Press
• Application aspects
• local (control)
• distributed (coordination)
• global (modes)
• Ethernet network
• Synchronous, Time-Triggered
• IEEE 1588 time-sync protocol
• High-speed, high precision
• Speed: 1 inch/ms (~100km/hr)
• Precision: 0.01 inch
-> Time accuracy: 10us Bosch-Rexroth
Courtesy of Ed Lee, UCB
Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html
Example – Flying Paster
Courtesy of Ed Lee, UCB
S.e n so r top dead center
Active paper feed
Paper cutt,er
Idle roller
!Flyi ng R,.~_$J.fil:.
Drive rol l,er
Dancer
Idle roller
!D rive ro ller
~--------------------------------------------------------------------------------------------------------------------------
Source: http://offsetpressman.blogspot.com/2011/03/how-flying-paster-works.html
Flying Paster
Courtesy of Ed Lee, UCB
Example: Medical Devices Emerging direction: Cell phone based medical devices for affordable healthcare e.g. “Telemicroscopy” project at Berkeley
e.g. Cell-phone based blood testing device developed at UCLA
Courtesy of Ed Lee, UCB
Example: Toyota autonomous vehicle technology roadmap, c. 2007
Source: Toyota Web site
DARPA Robotics Challenge
http://www.theroboticschallenge.org/
The Good News…
Rich time models
Precise interactions across highly extended spatial/temporal dimension
Flexible, dynamic communication mechanisms
Precise time-variant, nonlinear behavior
Introspection, learning, reasoning
Elaborate coordination of physical processes
Hugely increased system size with controllable, stable behavior
Dynamic, adaptive architectures
Adaptive, autonomic systems
Self monitoring, self-healing system architectures and better safety/security guarantees.
Computing/Communication Integrated CPS
Networking and computing delivers unique precision and flexibility in
interaction and coordination
…and the Challenges
Cyber vulnerability
New type of interactions across highly extended spatial/temporal dimension
Flexible, dynamic communication mechanisms
Precise time-variant, nonlinear behavior
Introspection, learning, reasoning
Physical behavior of systems can be manipulated
Lack of composition theories for heterogeneous systems: much unsolved problems
Vastly increased complexity and emergent behaviors
Lack of theoretical foundations for CPS dynamics
Verification, certification, predictability has fundamentally new challenges.
Computing/Communication Integrated CPS
Fusing networking and computing with physical processes brings new
problems
Abstraction layers allow
the verification of
different properties .
Key Idea: Manage design complexity by creating abstraction
layers in the design flow.
Abstraction layers define
platforms.
Physical Platform
Software Platform
Computation/Communication Platform
Abstractions are linked
through mapping.
Claire Tomlin, UC Berkeley
Example for a CPS Approach
Software models
Real-time system models
implementation correctness:
timing analysis (P)
Sifakis at al: “Building Models of Real-Time
Systems from Application Software,”
Proceedings of the IEEE Vol. 91, No. 1. pp.
100-111, January 2003
))(())(( , out inR ffE
OutTInTf 2 :
OutT
RRRInTf
2 :
PfE R ),( ),( ,
• : reactive program. Program execution creates a mapping between logical-time inputs and outputs.
• : real-time system. Programs are packaged into interacting components. Scheduler control access to computational and communicational resources according to time constraints P
f
Rf
In CPS, essential system properties
such as stability, safety,
performance are expressed in
terms of physical behavior
Abstraction layers and models:
Real-time Software
Physical models
OutT
RRRInTp
2 : OutT
RRRInTf
2 :;
Software models
Real-time system models
implementation correctness:
timing analysis (P)
))(())(( , out inR ffE
OutTInTf 2 :
OutT
RRRInTf
2 :
PfE R ),( ),( ,
Re-defined Goals:
• Compositional verification of
essential dynamic properties
− stability − safety
• Derive dynamics offering
robustness against
implementation changes,
uncertainties caused by faults
and cyber attacks
− fault/intrusion induced reconfiguration of SW/HW − network uncertainties (packet drops, delays)
• Decreased verification
complexity
implementation
Abstraction layers and models:
Cyber-Physical Systems
Why is CPS Hard?
package org.apache.tomcat.session;
import org.apache.tomcat.core.*;
import org.apache.tomcat.util.StringManager;
import java.io.*;
import java.net.*;
import java.util.*;
import javax.servlet.*;
import javax.servlet.http.*;
/**
* Core implementation of a server session
*
* @author James Duncan Davidson [[email protected]]
* @author James Todd [[email protected]]
*/
public class ServerSession {
private StringManager sm =
StringManager.getManager("org.apache.tomcat.session");
private Hashtable values = new Hashtable();
private Hashtable appSessions = new Hashtable();
private String id;
private long creationTime = System.currentTimeMillis();;
private long thisAccessTime = creationTime;
private long lastAccessed = creationTime;
private int inactiveInterval = -1;
ServerSession(String id) {
this.id = id;
}
public String getId() {
return id;
}
public long getCreationTime() {
return creationTime;
}
public long getLastAccessedTime() {
return lastAccessed;
}
public ApplicationSession getApplicationSession(Context context,
boolean create) {
ApplicationSession appSession =
(ApplicationSession)appSessions.get(context);
if (appSession == null && create) {
// XXX
// sync to ensure valid?
appSession = new ApplicationSession(id, this, context);
appSessions.put(context, appSession);
}
// XXX
// make sure that we haven't gone over the end of our
// inactive interval -- if so, invalidate and create
// a new appSession
return appSession;
}
void removeApplicationSession(Context context) {
appSessions.remove(context);
}
/**
* Called by context when request comes in so that accesses and
* inactivities can be dealt with accordingly.
*/
void accessed() {
// set last accessed to thisAccessTime as it will be left over
// from the previous access
lastAccessed = thisAccessTime;
thisAccessTime = System.currentTimeMillis();
}
void validate()
Software Control Systems
Crosses Interdisciplinary Boundaries
• Disciplinary boundaries need to be realigned • New fundamentals need to be created • New technologies and tools need to be developed • Education need to be restructured
Resilience
Cyber-Physical Systems:
Software Intensive Systems
Embedded software ….
is a crucial ingredient in modern
systems
is the ‘universal system integrator’
could exhibit faults that lead to
system failures
complexity has progressed to the
point that zero-defect systems
(containing both hardware and
software) are very difficult to build
need to evolve while in operation
The challenge is to build software intensive systems that
anticipate change: uncertain environments, faults, updates, and
exhibit resilience: they survive and adapt to changes, while
being dependably functional.
Resilience
Webster:
Capable of withstanding shock without permanent deformation or
rupture
Tending to recover from or adjust easily to misfortune or change.
Technical:
The persistence of the avoidance of failures that are unacceptably
frequent or severe, when facing changes. [Laprie, ‘04]
A resilient system is trusted and effective out of the box in a wide
range of contexts, and easily adapted to many others through
reconfiguration or replacement. [R. Neches, OSD]
Resilient system detects anomalies in itself, diagnoses
its causes, and is able to recover lost functionality.
Research issues
•Model-driven engineering of Resilient Software Systems
•Design-time + Run-time aspects
•Resilience to: (1) faults, (2) environmental changes, (3) updates
•Target system category: distributed, real-time, embedded systems
Objective: Model-based
engineering approach
and tools to build
verifiably resilient
systems
The definition…
Building Resilient Cyber-Physical
Systems
Physical Interaction
Computational Interaction
Refin
em
en
t/Co
mp
ilatio
n
Ab
stra
cti
on
Platform Layer
Physical Layer
Physical
Object Cyber-Physical
Object
Physical
Object Physical
Object
Cyber-Physical
Object
Computational
Object
Computational
Object
Computational
Object
Computational
Object
Computational
Object
Communication Platform Computational
Platform
Computational
Platform
Implementation Implementation
Cyber-Physical Systems
Layers and Interactions Fault
Fault
Fault
Fault
Cyber-Physical Systems
Faults and resilience
In CPS faults can appear in (and cascade to) any place
Physical system
Hardware (computing and communication) system
Software (application and platform) system
In CPS physical and cyber elements are integrated
Many interaction pathways: P2P, P2C, C2C, P2C2P, C2P2P2C
Many modeling paradigms for physical systems
Consider engineering or physics!
Heterogeneous models need to be integrated for detailed analysis
In CPS recovery can take many forms
Physical action
Cyber restart
Software adaptation
CPS and Model-based Design
Design of CPS layers via MDE
Software models
Platform models
Physical models
Challenge: How to integrate the models so that cross-domain
interactions can be understood and managed?
A Strategy for Resilient CPS Overall scheme:
Faults can originate in any layer of a hierarchy, in any component
Anomalies caused by the fault can be detected in the same or a higher layer
Based on anomalies a fault source isolation (diagnosis) is performed. The diagnosis result may be reported to a higher layer, depending on the nature of the fault.
The fault is locally mitigated first, but when that mitigation fails the higher layer is informed about the anomaly, the diagnosed fault, and the mitigation action taken.
High-level view: Fault management is a control problem. Faults are disturbances in the system whose effects prevent the system to
provide the required service/s
Anomalies are the sensory inputs, mitigation actions are the actuators of the fault management system
Fault mitigation must happen by considering (1) the current functional goals and (2) the actual state of the system, on the right level of abstraction
A Strategy for Resilient CPS
Layered fault management
Concepts:
1. Faults propagate to neighboring layers via
guaranteed behaviors
2. Each layer includes pro-active and reactive fault
management mechanisms
Each layer provides a ‘fault reporting’ and
‘fault management’ interface
Fault management services are built into the
‘middleware’:
Temporal/spatial Isolation
Fault Tolerant Clock Sync
Time-triggered Communications
Group Communication and Transactions
Fault-tolerant Resource Sharing
Component/Service Migration
Primary/Backup
Replication
Autonomous Failure Management
Safe Dynamic Composition of Components
System-level Fault Diagnostics
The need for resilience
In complex systems even simple
failures lead to complex cascades of
events that are difficult to understand
and manage.
How to
•detect and isolate faults?
•react to faults to mitigate their effect?
FACT: A model-driven toolsuite for system-level diagnostics
Run-time Platform (RTOS)
Visual modeling tool for creating:
•System architecture models
•Timed failure propagation graph models
Modular run-time environment contains: •Monitors detect anomalies in sensor data and track mode changes •TFPG Diagnostics Engine performs diagnosis and isolates the source(s) of observed anomalies •Reports are generated for operators and maintainers Modules can be used standalone on an embedded target processor with an RTOS
TFPG DIAGNOSTICS ENGINE
MO
NIT
OR
S
OPER
ATO
R
MA
INTA
INER
Modeling Language
Temporal Failure Propagation Graphs
•Failure modes
•Discrepancies
•Monitors/Alarms
•Propagation links with:
•Time delay
•Mode
Fault model:
Known physical failure modes whose functional effects (discrepancies) are
monitored.
Diagnostic problem:
Given a set of active monitors and their temporal activation sequence,
which failure mode(s) explain the observations?
A causal network-like
model describing how
component failure effects
propagate across the
system activating
monitors.
Failure propagation links
and monitors could be
mode-dependent.
Modeling Language
Temporal Failure Propagation Graphs
Modeling variants
•Untimed, causal network (no modes, propagation = [0..inf])
•Modal networks: edges are mode dependent
•Timed models
•Hierarchical component models
Nodes:
•Failure modes
•Discrepancies
• AND/OR
• Monitored (option)
Edges:
•Propagation delay: [min, max]
•Discrete Modes (activation)
Example models (#components,#failuremodes,#alarms)
•Trivial examples
•Simplified fuel system (~30,~80,~100)
•Realistic fuel system (~200,~400,~600)
•Aircraft avionics (~2000,~8000,~25000) – generated
TFPG Example
Example
Not shown:
- Timing on propagation links
- Components/hierarchy
- Modal propagation
TFPG captures cause-effect
relationships that can be modal
and temporal. Effects may be
cumulative and/or monitored.
Component
Legend
Timed Failure Propagation Graphs
Causal models that describe the system behavior in presence of faults.
Model is a labeled directed graph where Nodes represent either failure modes or discrepancies
Edges between nodes in the graph represent causality
Edges are attributed with timing and mode constraints on failure propagation.
A discrepancy can be either monitored unmonitored.
The monitor detects a sensory manifestation of an anomaly and generates alarms.
Failure Cascades
Propagation
links Alarm
Allocation Failure Modes Discrepancies Alarms
TFPG Example
Rocket Engine
TFPG Example
Propellant Tank
TFPG Example Turbo Pump
TFPG Example
Combustion Chamber
TFPG Reasoning
On-line diagnostics:
Input: Sequence of alarms and mode changes
Output: Sequence of sorted and ranked hypotheses containing failure mode(s)
that explain the observations (alarms, mode changes)
TFPG Hypothesis
TFPG Hypothesis: estimation of the current system state.
Directly, points to failure modes that “best” explain the
current set of observed alarms.
Indirectly, points to failed monitored discrepancies; those
with a state that is inconsistent with the (hypothesized) state
of the failure modes
Structure
List of possible Failure Modes
List of alarms in each set ( Consistent (C)/ Inconsistent
(IC)/Missing (M) / Expected (E))
Metrics : Plausibility/ Robustness/ Failure Rate
Hypotheses Evaluation Metrics
Hypotheses are evaluated based on the following
metrics:
Plausibility: reflects the support of a hypothesis based on the
current observed alarm state. It answers the question: Which
hypothesis to consider?
Robustness: reflects the potential of a hypothesis (evidence)
to change based on remaining alarms. It answers the question:
When to take an action?
Failure Rate: is a measure of how often a particular failure
mode has occurred in the past.
Run-time System
Diagnostics Engine
Algorithm outline:
Check if new evidence is explained by
current hypotheses.
If not, create a new hypothesis that
assumes a hypothetical state of the
system consistent with observations
Rank hypotheses for plausibility and
robustness
Discard low-rank hypotheses, keep
plausible ones
Fault state: ‘total state vector’ of the
system, i.e. all failure modes and
discrepancies
Alarms could be
Missing: should have fired but did not
Inconsistent: fired, but it is not consistent
with the hypothesis
Robust diagnostics: tolerates missing and
inconsistent alarms
Metrics:
Plausibility: how plausible is the
hypothesis w.r.t alarm consistency
Robustness: how likely is that the
hypothesis will change in the future
Run-time System
Diagnostics Engine
Novel properties:
Multi-fault hypothesis is the default
Fault state == State of all failure
modes/discrepancies
Reasoner works with sets of failure modes
(instead of individual failure modes)
Robust algorithm: can tolerate
missing/inconsistent alarms
Parsimony principle: Use simplest
explanation
Time-dependent diagnosis
Reasoner can be asked to recompute
diagnosis upon the advance of time
Extensions:
Modal edges: Propagation happens only if
edge is enabled (controlled by system
mode)
Diagnosis takes into consideration the last
propagation effect
Non-monotonic alarms:
Alarm retraction triggers a re-computation
of the diagnosis
Run-time System
Discrete (TFPG) Diagnostics
Additional capabilities:
Intermittent failure modes
Consequence: alarm/s change to ‘Off’
Assumption: low frequency intermittents
Upon alarm changing to ‘Off’, backtrack to
last change to ‘On’ and re-evaluate
Maintain alternate branches (for alarms ‘On’
and ‘Off’)
Test alarms: can be considered only
after activation
If inactive, it is an un-monitored
discrepancy.
If activated, it is used but timing may be
inconsistent (re: parent’s timing)
Metrics summary:
Plausibility:
Robustness:
For n failure modes and m discrepancies, maximum number of hypotheses is nm but more likely to be O(n).
Updating hypothesis is polynomial with the number of nodes and exponential w.r.t sensor faults.
Model #C #FM #D #A #M #P #R Avg.
Time (sec)
#1 15 36 48 21 0 120 1 0.000311
#2 11 36 120 174 27 3 1 0.000445
#3 153 481 1973 270 9 3409 1 0.013589
#4 24 64 116 116 0 695 4 0.016
#5 21 100 282 69 0 431 18 0.00288
• Keys: #C – Number of Components / #FM – Failure Modes/ #D – Discrepancies/ #A – Alarms/ #M –
Modes/ #P – Propagation links, #R – Regions
• Avg. Time = Average Computational Time taken by the reasoner (in seconds) after every event on
2.67GHz Intel Xeon® CPU, 8 GB RAM.
Performance Evaluation
Tool Operations
1. Modeling 2. Desktop experimentation,
validation
4. Deployment on
embedded platform
3. Feedback
Model
Interpretation
Software Health-Management
Motivation: Software as Failure Source?
Qantas 72 - Oct 7, 2008 – A330 (Australia) – ATSB Report At 1240:28, while the aircraft was cruising at 37,000 ft, the autopilot disconnected. From about
the same time there were various aircraft system failure indications. At 1242:27, while the crew was evaluating the situation, the aircraft abruptly pitched nose-down. The aircraft reached a maximum pitch angle of about 8.4 degrees nose-down, and descended 650 ft during the event. After returning the aircraft to 37,000 ft, the crew commenced actions to deal with multiple failure messages. At 1245:08, the aircraft commenced a second uncommanded pitch-down event. The aircraft reached a maximum pitch angle of about 3.5 degrees nose-down, and descended about 400 ft during this second event. At 1249, the crew made a PAN urgency broadcast to air traffic control, and requested a clearance to divert to and track direct to Learmonth. At 1254, after receiving advice from the cabin of several serious injuries, the crew declared a MAYDAY. The aircraft subsequently landed at Learmonth at 1350.
The investigation to date has identified two significant safety factors related to the pitch-down movements. Firstly, immediately prior to the autopilot disconnect, one of the air data inertial reference units (ADIRUs) started providing erroneous data (spikes) on many parameters to other aircraft systems. The other two ADIRUs continued to function correctly. Secondly, some of the spikes in angle of attack data were not filtered by the flight control computers, and the computers subsequently commanded the pitch-down movements.
http://www.atsb.gov.au/publications/investigation_reports/2008/AAIR/pdf/AO2008070_interim.pdf
Understanding the Problem
Embedded software is a complex engineering artifact that can have latent
faults, uncaught by testing and verification. Such faults become apparent
during operation when unforeseen modes and/or (system) faults appear.
The problem:
General: How to construct a Software Health Management system that
detects such faults, isolates their source/s, prognosticates their progression,
and takes mitigation actions in the system context?
Specific: How to specify, design, and implement such a system using a model-
based framework?
The larger picture:
General: Software Health Management must be integrated with System
Health Management – ‘Software Health Effects’ must be understood on the
System (Vehicle) Level.
What is ‘Systems Health Management’ ?
The ‘on-line’ view:
1. Detection of anomalies in system or component behavior
2. Identification and isolation of the fault source/s
3. Prognostication of impending faults that could lead to system failures
4. Mitigation of current or impending fault effects while preserving mission objective/s
Detection
Isolation
Prognostics
Mitigation
Observations Corrections
Reports
Examples:
- Automotive OBD (detection)
- Boeing 777 CMC (detection + isolation)
- Spacecraft fault protection (detection + isolation + mitigation)
- Aircraft fleet (detection + isolation + prognostics)
Software Health Management
Software is a complex
engineering artifact.
Software can have latent faults.
Faults appear during operation
when unforeseen modes or
interactions happen.
Techniques like Voting and Self-
Checking pairs have
shortcomings
Common mode faults
Fault cascades
• SHM is the extension of FDIR
techniques used in Physical systems to
Software.
Fault detection
Stimuli Responses
Environmental
Assumptions
Domain
Assumptions
Observed
Behavior
Observed Inputs Fault isolation
Fault mitigation
Why ‘Software Health Management’?
Complexity of systems necessitates an additional layer ‘above’ SFT that manages ‘Software Health’
Embedded software …. is a crucial ingredient in aerospace systems
is a method for implementing functionality
is the ‘universal system integrator’
could exhibit faults that lead to system failures
complexity has progressed to the point that zero-defect systems (containing both hardware and software) are very difficult to build
Systems Health Management is an emerging field that addresses precisely this problem: How to manage systems’ health in case of faults ?
‘Software Health Management’ is not… A replacement for existing and robust engineering processes and standards
(DO-178B)
A substitute for hardware- and software fault tolerance
An ‘ultimate’ solution for fault tolerance
Software Health Management
Key ideas
Use software components as units of fault management: detection, diagnosis,
and mitigation
Components must be observable, provide fault isolation, and be capable of mitigation
Use a two-level architecture:
Component level: detect anomalies and mitigate locally
System level: received anomaly reports, isolate faulty component(s), and mitigate
on the component
Use models to represent
anomalous conditions
fault cascades
mitigation actions (when / what)
Use model-based generators to synthesize code artifacts
Developer can use higher-level abstractions to design and implement the
software health management functions of a system
Software Component Framework
The Component Model should enable:
Monitoring
Interfaces (synchronous/asynchronous calls)
Component state
Scheduling and timing (WCET)
Resource usage
Anomaly Detection via:
Pre/post conditions over call parameters, rates, and component state
Conditions over (1) timing properties, (2) resource usage (e.g. memory footprint), and (3)
usage patterns
Combinations of the above
Mitigation:
Given detected anomaly and state of the component take action
Can be time- or event-triggered
Actions: restart, initialize, block call, inject value, inject call, release resource, modify state;
checkpoint/restore, combination of the above
Notional Component Model
A component is a unit (containing potentially many objects). The component is parameterized, has
state, it consumes resources, publishes and subscribes to events, provides interfaces to
and requires interfaces from other components.
Publish/Subscribe: Event-driven, asynchronous communication (publisher does not wait)
Required/Provided: Synchronous communication using call/return semantics.
Triggering can be periodic or sporadic.
Extension of a Component Model defined by OMG (CCM) : state, resource, trigger interfaces.
Resource Trigger
Subscribe
(Event) Publish
(Event)
Provided
(Interface) Required
(Interface)
State
Parameter
Component
Example: Component Interactions
Components can interact via asynchronous/event-triggered and synchronous/call-driven connections.
Example: The Sampler component is triggered periodically and it publishes an event upon each
activation. The GPS component subscribes to this event and is triggered sporadically to obtain
GPS data from the receiver, and when ready it publishes its own output event. The Display
component is triggered sporadically via this event and it uses a required interface to retrieve the
position data from the GPS component.
Sampler
Component GPS
Component
Display
Component
P S
S
Component Monitoring
Component
Monitor arriving
events
Monitor incoming
calls
Monitor published
events
Monitor outgoing
calls
Observe state
Monitor resource
usage
Monitor control flow/
triggering
ACM:
The ARINC Component Model
Provide a CCM-like layer on top of ARINC-653 abstractions
Notional model:
Terminology:
Synchronous: call/return
Asynchronous: publish-return/trigger-
process
Periodic: time-triggered
Aperiodic: event-triggered
Note:
All component interactions are realized via the framework
Process (method) execution time has deadline, which is monitored
ACM:
The ARINC Component Model
Each ‘input interface’ has its own process
Process must obtain read-write/lock on component
Asynchronous publisher (subscriber) interface:
Listener (publisher) process
Pushes (receives) one event (a struct), with a validity flag
Can be event-triggered or time-triggered (i.e. 4 variations)
Synchronous provided (required) interface:
Handles incoming synchronous RMI call
Forwards outgoing synchronous RMI call
Other interfaces:
State: to observe component state variables
Resource: to monitor resource usage
Trigger: to monitor execution timing
ACM:
A Prototype Implementation
ARINC-653 Emulator
Emulates APEX services using Linux API-s
Partition Process, Process Thread
Module manager: schedules partition set
Partition level scheduler: schedules threads within partition
CORBA foundation
CCM Implementation
No modification
ACM component interactions
Mainly implemented via APEX
RMI interactions use threads
Implementation: Mapping ACM to APEX
APEX - Abstractions Platform (Linux)
Module Host/Processor
Partition Process
Process Thread
ACM: APEX Component Model APEX APEX Concept Used
Component method Periodic Periodic process Process start, stop
Semaphores Sporadic Aperiodic process
Invocation Synchronous
Call-Return
Periodic
Target
Co-located N/A
Non-co-located N/A
Sporadic
Target
Co-located Caller method signals callee to release
then waits for callee until completion.
Event, Blackboard
Non-co-located Caller method sends RMI (via CM) to
release callee then waits for RMI to
complete.
TCP/IP, Semaphore,
Event
Asynchronous
Publish-Subscribe
Periodic
Target
Co-located Callee is periodically triggered and polls
‘event buffer’ – validity flag indicates
whether data is stale or fresh
Blackboard
Non-co-located Sampling port, Channel
Sporadic
Target
Co-located Callee is released when event is available Blackboard,
Semaphore, Event
Non-co-located Caller notifies via TCP/IP, callee is
released upon receipt
Queuing port,
Semaphore, Event
ACM:
Modeling Language Modeling elements:
Data types: primitive, structs, vectors
Interfaces: methods with arguments
Components:
Publish/Subscribe ports (with data type)
Provided/Required interfaces (with i/f type)
Health Manager
Assemblies
Deployment
Modules, Partitions
Component Partition
Example: Sensor/GPS/Display
get
gps_data_sourcedata_in
invokes
component NavDisplay {
consumes SensorOutput data_in ; //APERIODIC
uses GPSDataSource gps_data_source ;} ;
data_out
component Sensor {
Publishes SensorOutput data_out ; };
data_out
get
gps_data_src
GPSValue
data_in
reads
invokes
readsupdates
component GPS {
publishes SensorOutput data_out ; //APERIODIC
consumes SensorOutput data_in ; //PERIODIC
provides GPSDataSource gps_data_src ; };
Sensor
GPS
Nav Display
struct SensorOutput
{
Timespec time ;
SensorData data ;
};
struct SensorData
{
FLOATINGPOINT alpha ;
FLOATINGPOINT beta ;
FLOATINGPOINT gamma ;
};
struct Timespec
{
LONGLONG tv_sec ;
LONGLONG tv_nsec ;
};
interface GPSDataSource
{
void getGPSData (out GPSData d);
};
Component Port Period Time Capacity Deadline
Sensor data_out 4 sec 4 sec Hard
GPS data_out aperiodic 4 sec Hard
GPS data_in 4 sec 4 sec Hard
GPS gps_data_src aperiodic 4 sec Hard
Navdisplay data_in aperiodic 4 sec Hard
Navdisplay gps_data_src aperiodic 4 sec Hard
Anomaly Detection
Model-Based Specification of monitoring expressions Post/Pre condition violations:
threshold, rate, custom filter (moving average)
Resource Violations: Deadline
Validity Violation: Stale data on a consumer
Concurrency Violations: Lock timeouts.
User code violations: reported error conditions from application code.
Code Generators Synthesize code for
implementing the monitors
Monitor
arriving events
Monitor
incoming calls
Monitor published
events
Monitor outgoing
calls
Observe state Monitor resource
usage
Monitor control
flow/ triggering
Port Monitors
Non-Port Monitors
• Based on these local detection, each component developer can implement a local health manager
• It is a reactive timed state machine with pre specified actions.
• All alarms, actions are reported to the system health manager
ACM:
Modeling Language: Monitoring Monitoring on component interfaces
Subscriber port ‘Subscriber process’ and
Publisher port ‘Publisher process’
Monitor: pre-conditions and post-conditions
On subscriber: Data validity (‘age’ of data)
Deadline (hard / soft)
Provided interface ‘Provider methods’ and
Required interface ‘Required methods’
Monitor: pre-conditions and post-conditions
Deadline (hard / soft)
Can be specified on a per-component basis
Monitoring language:
Simple, named expressions over input (output)
parameters, component state, delta(var), and
rate(var,dt). The expression yields a Boolean condition.
74
Component-level Health Management Manager’s behavioral model:
Finite-state machine
Triggers: monitored events, time
Actions: mitigation activities
Manager is local to component
container (for efficiency) but shall be
protected from the faults of functional
components
Notional behavior:
Track component state changes via
detected events and progression of
time
Take mitigation actions as needed
Design issues:
Co-location with component (fault containment)
Local detection may implicate another component
Component
Component Framework
Monitor
Man
ager
Actions
Events
Events
Idle
Exec
InvA
WCET
start
finish
timeout
/init
invA_violation
/reset
ACM - Modeling Language:
Component Health Manager
Reactive Timed State Machine
Event trigger:
Predefined conditions (e.g. deadline violation, data validity validation)
User-defined conditions (e.g. pre-condition violation)
Reaction: mitigation action (start, reset, refuse, ignore, etc.)
State: current state of the machine
(Event X State) Action
Component Health Management Available Actions
HM Response
Component
NOMINAL ERROR CHECKRESULT FAILURE
ErrorMessage /Action
Action Successful
Timeout orAction Failed
Component Health Manager (High priority ARINC-653 process)
BUFFER
IncomingEvents
Process 1Process 3
Component Port (653 PRocess)
HM ResponseBlackBoard
BlackBoardBlackBoard
Blocking Read
Architecture
Assembly Definition
Model-Based Software Health Management 78
The Sensor component is triggered periodically and it publishes an event upon each activation.
The GPS component subscribes to this event and is triggered periodically to obtain GPS data from the receiver. It publishes its own output event.
The Nav Display component is triggered sporadically via this event and it uses a required interface to retrieve the position data from the GPS component.
Validity(GPS.data_in)<4ms
Delta(Nav.data_in.time)>0
Rate(gps_data_src.data)>1
Specified Monitoring Conditions
System-level Health Management
Focus issue: Cascading faults
Hypothesis: Fault effects cascade via component interactions
Anomalies detected on the component level are not
‘diagnosed’ can be caused by other components
Problem:
How to model fault cascades?
How to diagnose and isolate fault cascade root causes?
How to mitigate fault cascades?
Recap: Fault diagnosis
Model: Timed Failure Propagation Graphs
Modeling variants
•Untimed, causal network (no modes, propagation = [0..inf])
•Modal networks: edges are mode dependent
•Timed models
•Hierarchical component models
Nodes:
•Failure modes
•Discrepancies
• AND/OR (combination)
• Monitored (option)
Edges:
•Propagation delay: [min, max]
•Discrete Modes (activation)
Example models (#components, #failuremodes, #alarms)
•Trivial examples
•Simplified fuel system (~30,~80,~100)
•Realistic fuel system (~200,~400,~600)
•Aircraft avionics (~2000,~8000,~25000) – generated
Recap: Fault diagnosis
• Outline:
– Check if new evidence is explained
by current hypotheses.
– If not, create a new hypothesis that
assumes a hypothetical state of the
system consistent with observations
– Rank hypotheses for plausibility and
robustness metrics
– Discard low-rank hypotheses, keep
plausible ones Fault state: ‘total state vector’ of the system,
i.e. all failure modes and discrepancies
Alarms could be
Missing: should have fired but did not
Inconsistent: fired, but it is not consistent
with the hypothesis
Robust diagnostics: tolerates missing and
inconsistent alarms
Metrics:
Plausibility: how plausible is the
hypothesis w.r.t. alarm consistency
Robustness: how likely is that the
hypothesis will change in the future
Fault diagnosis algorithm:
Modeling Cascading Faults
Not needed - the cascades can be computed from the
component assemblies, if the anomaly types and their
interactions are known.
Component ‘elements’
Every method belongs to one of these (7)
Fault cascades within component
(A few of the 38 patterns)
Modeling Cascading Faults
Inter-component propagation is regular – always follows the same pattern
Intra-component propagation depends on the component! Need to model internal dataflow and control flow of the component.
Note: Could be determined via source code analysis.
Modeling Cascading Faults
Fault Propagation Graph for GPS Example
Here: hand-crafted, but it is generated automatically in the
system
System-level Fault Mitigation Model-based system-level mitigation engine
Model-based diagnoser is automatically generated
Designer specifies fault mitigation
strategies using a reactive state machine
Component Platform
Managed Component
Component CHM
Managed Component
Component CHM
Component Fault Model Component Fault Model
FM FM
FM
FM
D D
FM D D
D
D
D
Diagnoser Engine Mitigation Engine Advantages:
• Models are higher-level
programs to specify
(potentially complex)
behavior – more readable and
comprehensible
•Models lend themselves to
formal analysis – e.g. model
checking
System-level Fault
Mitigation
Model-based mitigation specification at
two levels
Component level: quick action
System level: Reactive action taking the
system state into consideration
System designer specifies them as a
parallel timed state machine.
Fixed set of mitigation actions are
available
Runtime code is generated from
models
Advantages:
Models are higher-level programs to
specify (potentially complex) behavior –
more readable and comprehensible
Models lend themselves to formal
analysis – e.g. model checking
Diagnoser Engine Mitigation Engine
HM Action Semantics
CLHM: IGNORE Continue as if nothing has happened
CLHM:ABORT Discontinue current operation, but opera-
tion can run again
CLHM: USE PAST DATA
Use most recent data (only for operations that expect fresh data)
CLHM: STOP Discontinue current operation
Aperiodic processes (ports): operation can run again
Periodic processes (ports): operation must be enabled by a future START HM action
CLHM: START Re-enable a STOP-ped periodic operation
CLHM RESTART A Macro for STOP followed by a START for the current operation
SLHM: RESET Stop all operations, initialize state of component, clear all queues. start all periodic operations
SLHM: STOP Stop all operations
Alarms Alarms
Alarms
List of predefined Mitigation Actions
System-level Health Management
Functional components
1. Aggregator:
Integrates (collates) health information coming
from components (typically in one hyperperiod)
2. Diagnoser:
Performs fault diagnosis, based on the fault
propagation graph model
Ranks hypotheses
Component that appears in all hypotheses with
the highest rank is chosen for mitigation
3. Response Engine:
Issues mitigation actions to components based
on diagnosis results
Based on a state machine model that maps
diagnostic results to mitigation actions
These components are generated
automatically from the models
The Health Management Approach:
1. Locally detected anomalies are mitigated
locally first. – Quick reactive response.
2. Anomalies and local mitigation actions are
reported to the system level.
3. Aggregated reports are subjected to
diagnosis, potentially followed by a system-
level mitigation action.
4. System-level response commands are
propagated down to components.
Example: 2005 Malaysian Air Boeing 777 in-flight upset
Low airspeed advisory.
Airplane’s autopilot experienced excessive acceleration values.
Vertical acceleration decreased to -2.3g within ½ second
Lateral acceleration decreased to -1.01g (left) within ½ second
Longitudinal acceleration increases to +1.2 g within ½ second
Autopilot pitched nose-up to 17.6 degree and climbed at a vertical speed
of 10,650 fpm.
Airspeed reduced to 241 knots.
Stick shaker activated at top of the climb.
Aircraft descended 4,000 ft.
Re-engagement of autopilot followed by another climb of 2,000 ft.
Maximum rate of climb = 4440 fpm.
B777 ADIRU Architecture • Designed to be serviceable with
one fault in each FCA
• Can fly but maintenance
required upon landing with two
faults in each FCA
• Each ARINC 629 end unit voted
on the processor data bit-by-
bit.
• Processors monitor the ARlNC
629 modules by full data wrap-
around
• Processors also monitor the
power supplies, any one of
which can power the entire unit
• Accelerometer and gyro in
skewed redundant configuration
• A S(econdary)AARU also
provided inertial data
Based on Air Data Inertial Reference Unit (ADIRU)
Architecture (ATSB, 2007, p.5)
Cause of Inflight Upset
June 2001: accelerometer 5 fails with high output value, ADIRU disregards it.
A power cycle on ADIRU occurs. A latent software bug disregards the faulty status
of accelerometer 5.
Status of failed unit was recorded on-board maintenance memory, but that memory was
not checked by the software.
An inflight fault was recorded in accelerometer 6 and it was disregarded.
FDI software allowed use of accelerometer 5.
High acceleration value was passed to all computers.
Due to common-mode nature of fault, voters allowed high accelerometer data to
go on all channels.
This high value was used by primary flight computer.
Mid value select function used by the flight computer lessened the effect of pitch
motion.
Problem: System relied on redundancy to mask a fault. But due to latent software
bug and common-mode fault, the effect cascaded into the system failure Reading Material: The dangers of failure masking in fault-tolerant software: aspects of a recent in-flight upset event
C.W. Johnson and C.M. Holloway, IET Conf. Pub. 2007, 60 (2007), DOI:10.1049/cp:20070442
Case Study • Modeled the architecture as a
software component assembly
• Created the fault scenario
• Only modeled part of the system
to illustrate the point of SHM
• Accelerometers are arranged on
six faces of a dodecahedron.
Used for regression Equations
ADIRU Assembly (Accelerometers)
Runs at 20 Hz
ADIRU Assembly (Processors)
Observer tracks the age
of accelerometer data.
Specified as timed state
machine (with timeout)
Runs at 20 Hz
ADIRU Assembly (Voters)
Runs at 20 Hz
ADIRU Assembly (Display- Mimics PFC)
Runs aperiodically
Deployment Model
Each Module is a processor running the
ARINC Component Runtime Environment
Execution
Accelerometers
Machine – durip02
SHM
Machine – durip09
Voter + Display Computer
Machine – durip06
ADIRU Computers
Machine – durip03
Accelerometers
VOTERS + DISPLAY SHM
ADIRU Processors
System Health Manager
These components are auto generated
The hypothesis generated by the diagnoser is translated to
Component(s) that is most likely faulty. This list is fed to
Response Engine, which triggers the mitigation state machine
other machines have similar specification
Demonstration
Fault Scenario
Accelerometer 5 has initial fault
It is started which causes an alarm
Then Accelerometer 6 develops fault
Successful mitigation
Identifying the faulty components
Stopping the fault components
Processors can still function with four accelerometers.
Demonstration: Faulty Scenario
Resilient architectures and
autonomy
Resilience and autonomy
Model-based Software Health Management
Requires explicit specification of component-level and system-level health management (recovery) actions
Complex and error-prone… too many options!
Resilient systems should recover autonomously
Concepts:
Model the system architecture + functions.
Express what is needed from the system to implement functions.
Embed models into the run-time system
Use a reasoner to figure out how to recover function upon failures
Modeling
Functional Requirements for IMU
Inertial Position
• Determine inertial position.
• Functional Requirement (AND)
GPS Position
Position Tracking
GPS Position
• Sense GPS position for computing
Inertial Position
Position Tracking
• Continuously track position to compute
Inertial position
• Functional Requirement
Body Acceleration Measurement
Body Acceleration Measurement
• Sense body acceleration for Position
Tracking.
Inertial
Position
GPS
Position
Position
Tracking
Body
Acceleration
Measurement
Modeling
Complete Redundant Architecture
Modeling the Architecture
Function Allocation
Body Acceleration
Measurement EXACTLY ONE (Primary /Secondary ADIRU
Subsystem)
ADIRU Subsystem has
• Accelerometers (6)
• ADIRU Computers (4)
• Voters (3)
Functional / Operational ADIRU Subsystem
requires
• ATLEAST 4 of 6 Accelerometers
• ATLEAST 2 of 4 Filters or ADIRU
computers
• ATLEAST 1 of 3 Voter
Inside one ADIRU:
Modeling the Architecture
Function Allocation
GPS Position EXACTLY ONE (Primary/Secondary
GPS Subsystem)
GPS Subsystem includes
GPS Receiver (1)
GPS Processor (1)
Functional / Operational GPS
subsystem requires
EXACTLY ONE of GPS Receiver
EXACTLY ONE of GPS Processor
Inside one GPS Subsystem:
Modeling the Architecture
Function Allocation
POSITION TRACKING ATLEAST ONE OF ( LEFT/ CENTER/
RIGHT PFC NavFilter Subsystem)
PFC NavFilter Subsystem includes
PFC Nav Filter (1)
PFC Processor (1)
Functional/ Operational Requirement
for PFC Subsystem
EXACTLY ONE PFC NavFilter
EXACTLY ONE PFC Processor
Inside one PFC Subsystem:
Component Operational Requirement
EXPLICIT – Local dependency
Display Subsystem
ATLEAST 1 of 3 Consumers (Left, Center, Right)
EXPLICIT – Local dependency
ADIRU Computer inside ADIRU Subsystem
ATLEAST 4 of 6 Consumer Port
Implies
ATLEAST 4 of 6 Accelerometer Components
Component Operational Requirement
IMPLICIT – Inferred dependency
PFC NavFilter in PFC Subsystem
EXACTLY 1 of 1 Consumer Port AND
ATLEAST 1 of 1 Requires Port
Implies
EXACTLY 1 of 2 ADIRU Subsystems AND
ATLEAST 1 of 2 GPS Subsystem
Component Operational Requirement
IMPLICIT – Inferred dependency
PFC Processor inside PFC Subsystem
EXACTLY 1 of 1 Consumer Port
Implies
EXACTLY 1 of 1 PFC NavFilter
GPS Processor inside GPS Subsystem
EXACTLY 1 of 1 Consumer Port
Implies
EXACTLY 1 of 1 GPS Receiver
Modeling the problem:
Boolean SAT
Functional Requirements + Function allocation +
Component operational requirements + Component states
Encoded as Boolean (CNF) Expression for SATisfiability
problem
Solution: valid component architecture
Size: #Variables: 493/ #Clauses: 1776 FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Verifying Initial State 0.004228 No commands. Initial State accepted as satisfying/
meeting functional requirements.
Initial Configuration
Fault: ADIRU Accelerometer
Fault introduced, anomaly detected, fault source
component diagnosed, then:
Compute the new component architecture that satisfies
the functional requirements AND minimizes the number
of reconfiguration changes
FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Primary_
ADIRU_Subsystem_
Accelerometer6
0.002989
STOP Primary_ADIRU_Subsystem_Accelerometer6
Primary_
ADIRU_Subsystem_
Accelerometer5
0.003151
STOP Primary_ADIRU_Subsystem_Accelerometer5
Primary ADIRU Subsystem
Partial fault – Primary still functional
FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Primary_
ADIRU_Subsystem_
Accelerometer4
0.020825
STOP Primary_ADIRU_Subsystem_Accelerometer4
STOP Primary_ADIRU Subsystem
(stop all accelerometers, ADIRU computers, Voters in Primary
ADIRU subsystem)
START Secondary_ADIRU Subsystem
(start all accelerometers, ADIRU computers,
Voters in Secondary ADIRU subsystem)
ADIRU Accelerometer Fault
(contd.)
3rd fault failover to secondary ADIRU
Primary ADIRU Subsystem
Complete fault
Primary ADIRU Subsystem Faulty
Failover to secondary ADIRU
GPS Fault
FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Primary_
GPS_Subsystem_
GPSProcessor
0.004720
STOP Primary_GPS_Subsystem
(stop GPS Receiver, GPS Processor)
START Secondary_GPS Subsystem
(start GPS Receiver, GPS Processor)
Primary GPS Subsystem Faulty
Reconfiguration after
Primary GPS Subsystem becomes faulty
PFC NavFilter Faults
FAULT / Scenario SAT solver -
RECONFIG
COMPUTE
Time (s)
RECONFIG
COMMANDS
Left_
PFC_Subsystem_
PFCNavFilter
0.003107
STOP
Left_PFC_Subsystem
( stop PFCNavFilter, PFC Processor)
Right_
PFC_Subsystem_
PFCNavFilter
0.003089
STOP
Right_PFC_Subsystem
( stop PFCNavFilter, PFC Processor)
Left PFC NavFilter Faulty
Right PFC NavFilter Faulty
Research challenges
Modeling and engineering of r-CPS
Modeling paradigm / verification paradigm / synthesis
Verify recoverability under all scenarios
Efficient recovery
Analytics:
Comparing architectures and solutions
Resilience against…
Cascading, cross-domain faults
Cyber attacks possibly with physical faults
Engineering process
‘Simian army’ or systematic design?
Principles of multi-layer resilience