CEC 450 Real-Time Systems

November 10, 2020 Sam Siewert

CEC 450Real-Time Systems

Lecture 13 – High Availability and Reliability for Mission Critical Systems

High Level ConceptsBuild the system from high quality components (certified testing and durability, software coverage criteria, etc.)

No SPOFs! - No Single Point of Failure in the system

Redundancy, Recovery, and Fail-Safe on Double Faults

Triple Faults (loss of asset) unlikely given Single Fault Recovery and Fail-Safe protection– Double fault must occur before single fault is cleared – small time

window, likelihood is product of probabilities– Double fault initiates fail-safe– E.g. Aircraft avionics can use “Side-B” component or sub-system

with the “Side-A” primary fails (Single fault recovery)For double fault, the same component or sub-system must fail twice on the same flightOn double fault, the aircraft lands immediate at nearest alternate airfield

Sam Siewert 2

Sam Siewert 3

RASMReliability– High Quality Components (Unit Test)– Redundancy

Dual StringCross Strapping

– Stress TestingAvailability– Five 9’s– Available for Service 99.999% of the time– Can be down 5 Minutes and 16 Secs Per Year– Recovery - Quick

Serviceability– Field Debug– Field Component/Subsystem Replacement

Maintainability– Field Upgrade– Remote Firmware/Software Upgrade

Sam Siewert 4

Reliability

Quality Units– Hardware

EMI/EMC, Thermal Cycling, ESD, Radiation– Firmware

Hardware/Software IntegrationHardware Fault Injection and Simulation

– SoftwareRigorous Test Criteria – Test Exit RequirementsWhite-box Unit Tests

– Full path coverage testing– Statement coverage– Multiple Condition / Decision coverage

Black-box Unit and System Tests– Test Drivers and Test Harnesses– Feature Point and Function Testing– Stress Testing (Extremes, Corner Cases, Duration)

System Level TestsRegression Testing of Units and System

Sam Siewert 5

Reliability and Recovery

Redundancy– Dual String

Side A, Side BPilot, Co-Pilot

– Fail-OverFault Detection, Protection, Recovery

– Backup SystemIndependent Designe.g. Backup Flight System

– Cross Strapping of SidesDual String A & B3 Components C1, C2, C38 Possible Configurations4 Component SwitchesA|B Select Switch

C1 C1

C2 C2

C3 C3

A B

Configurations C1 C2 C3

1 A A A

2 A A B

3 A B A

4 A B B

5 B A A

6 B A B

7 B B A

8 B B B

SW1 SW2

SW3 SW4

Sam Siewert 6

High AvailabilityService Up-time is Figure of Merit– Number of Times down?– How long down?– Quick recovery is key

Hot or Warm Spare Equipment Protection– Fault Detection and Fail Over Without Service Outage– Excess Capacity

E.g. Diverse Routing in a NetworkOverlapping Coverage in Cell Phone SystemsOn-orbit spare satellites

Sam Siewert 7

Availability vs. Reliability

Are They the Same?– Are all Reliable Systems Highly Available?– Are all Highly Available Systems Reliable?

Reliability = Long MTBF– Mean Time Between Failure– FDIR When Failures Do Occur

Fault Detection, Isolation and RecoverySafingMTTR (Mean Time to Recover)

Availability = MTBF / (MTBF + MTTR) = % Uptime– MTBF = 8,766 hours (525,960 minutes)– MTTR = 5 minutes– Availability = 525,960 / (525,960 + 5) = 99.999% Uptime

Negative Tests - HA Design Criteria5 9’s – Out of Service no more than 5 Minutes per Year [Up 99.999% of the Time]

Availability = MTTF / [MTTF + MTTR]

MTTF ≈ MTBF

E.g. 0.99999 ≈ (365.25x60x24 minutes) / [(365.25x60x24) + 5 minutes]

MTBF=Mean Time Between Failures, MTTR=Mean Time to Recovery

Sam Siewert 8

Big iron lessons, Part 2: Reliability and availability:What’s the difference?

Apply RAS architecture lessons to the autonomicSelf-CHOP roadmap

AvailabilityTime System is Available for UseTime System is Unavailable for UseRatio of Time Available over Total Time Both Available and Not Available

Some Variation in Definitions of MTTF, MTBF, MTTR– MTBF = MTTF + MTTR [In Book]– MTTF is Simply Time Between Failures (Observed in Accelerated Lifetime / Stress Tests)– MTBF Often Stated – Does it Really Include MTTR or Not?– Typically Recovery Time is Small, So Negligible, But if Long, Difference is More Important– Just Be Clear on Definitions [Can Vary]

% of Time System is Available for Use Each Year (Or Other Arbitrary Period of Time)

Sam Siewert 9

available recovery available recovery

MTBFFailure #1 Failure #2

MTTF

MTTR

Sam Siewert 10

ServiceabilityField Anomaly Diagnosis– Trace capabilities– Debug capabilities– Error detection and prediction– Performance monitoring– Ease of Component, Sub-system Replacement– Diagnostics and Isolation

Sam Siewert 11

MaintainabilityField Upgrade Capabilities and LimitationsMake Before Break– Can I upgrade while Unit remains in service?– Can I test upgrade prior to service switch over?– Can I fall back to old configuration on failure of new?– E.g. – Upload new Boot Code, it Fails, Rolling Reboot Causes

Fail Over to Old Image

Sam Siewert 12

Component, Unit, Sub-System Reliability

Stress Testing

– HardwareThermal CyclingVacuum ChamberRadiation DosingMechanical Shock and Vibe Tests

– Firmware/SoftwareHardware sub-system or component failureRepeated Testing for Long Duration

– Memory Leaks– Race Conditions– Priority Inversions– Deadlock

Monte Carlo Analysis

Sam Siewert 13

Monte Carlo AnalysisRepeatable Pseudo Random Sequence– Can Re-run A Failure Case with Same Seed– Nominal Value +/- Dispersion

Dispersion is a Random VariationUniform or Normalized Distribution

– Drive Timing Variations– Drive Input Variations– Drive Simulation Variations

Sam Siewert 14

Beyond RASM?Autonomic Architectures4 Goals (self-CHOP)

– Self-Configuring– Self-Healing– Self-Optimizing– Self-ProtectingEight Features

1. “Self Knowing” – Status at All Levels, BIST, Used by Peers and in Hierarchy

2. “Configurable/Reconfigurable” – FPGA, Fusing, Cross Strapping

3. “Self Optimizing” – Built In PMU, Trace, SPA

4. “Self Healing” – Recovery, Fusing, Excess Capacity, Redundancy, Autonomous Repair

5. “Self Protecting” – Virus, Security, Data Integrity

6. “Discovery” – Finding and Identifying Resources

7. “Interoperable” – Component, Subsystem, System

8. “Self Managing” – Low or No Admin (Zero Uplink)

Sam Siewert 15

Motivation for Autonomic

Moore’s Law Reaching Limit?– GHz of CPU Cycles, Multiple CPUs– Gbits/sec of Bandwidth– GBytes of Memory– Terra Bytes of Storage

Ability to Manage These Resources Reaching Limit?

Maintenance Bigger Issue than Capacity, Throughput, and Bandwidth

Comparable to Evolution of Telecomm– Party Lines (Users Manage and Arbitrate for Resources Directly)– Operators for Exchanges Configure Connections (Centralized Management)– Switching Systems (Automated Operations, Service Provisioning)

Number of Administrators/Operators Per System?– N to 1 (Mainframes and Supercomputers)– 1 to 1 (Minicomputers, File Servers, Workstations, PCs)– 1 to N (Clusters, GRID Computing, RAID)

Sam Siewert 16

Autonomic Characteristics

Examples of Autonomic Features Today– CPU

Watch-dog timer for Software SanitySoftware recovery for missed service deadlineeFuse in circuit enable/disable for post manufacture reconfigurationPMU for on-chip performance monitoring

– IOSensor Fusion and Sensor/Actuator Cross strappingDiverse routing through networksCRC, timeouts, re-transmission

– MemoryEDAC memory

– StorageRAID5 volume rebuild on drive failureRAID6 dual failure handling

Sam Siewert 17

Embedded Autonomic?Are Autonomic Principles System Level Only?Do they Apply to Embedded Systems?To SoC?To Enterprise Systems?

Sam Siewert 18

Some Web LinksMission Critical SW/HW Component Examples

Intel Wind River Systems, OverviewXilinx TMR, TMRToolTI, Hercules Safety MCUsConcurrent, RedHawk Linux

RASM (Enterprise Systems)Oracle, Sun RASIBM, P-series RASIBM, Z-series RASMS, HA and DR for SQL ServerCisco, HA Overview

Autonomic TheoryIEEE ICAC 2019IBM Research - Autonomic Computing

https://www.windriver.com/missioncritical/safety/

https://www.xilinx.com/publications/prod_mktg/CS11XX_TRMTool_Product_Brief_FINAL0806.pdf

http://www.ti.com/en/download/mcu/SPRB204.pdf?DCMP=hercules&HQS=hercules-mc

https://www.concurrent-rt.com/products/redhawk-linux/

https://docs.oracle.com/cd/E19065-01/servers.e25k/817-4136-13/3_RAS.html

https://www.ibm.com/downloads/cas/2RJYYJML

https://pdfs.semanticscholar.org/7a3f/1d886768b55e3112292b775d7931cd7b6446.pdf

https://techcommunity.microsoft.com/t5/Premier-Field-Engineering/An-overview-of-High-Availability-and-Disaster-Recovery-solutions/ba-p/370479

https://www.cisco.com/c/dam/en/us/products/collateral/ios-nx-os-software/high-availability/ios_ha_overview.pdf

https://icac2019.cs.umu.se/

http://researchweb.watson.ibm.com/journal/sj42-1.html

Documents

CEC 450 Real-Time Systems