Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
November 10, 2020 Sam Siewert
CEC 450Real-Time Systems
Lecture 13 – High Availability and Reliability for Mission Critical Systems
High Level ConceptsBuild the system from high quality components (certified testing and durability, software coverage criteria, etc.)
No SPOFs! - No Single Point of Failure in the system
Redundancy, Recovery, and Fail-Safe on Double Faults
Triple Faults (loss of asset) unlikely given Single Fault Recovery and Fail-Safe protection– Double fault must occur before single fault is cleared – small time
window, likelihood is product of probabilities– Double fault initiates fail-safe– E.g. Aircraft avionics can use “Side-B” component or sub-system
with the “Side-A” primary fails (Single fault recovery)For double fault, the same component or sub-system must fail twice on the same flightOn double fault, the aircraft lands immediate at nearest alternate airfield
Sam Siewert 2
Sam Siewert 3
RASMReliability– High Quality Components (Unit Test)– Redundancy
Dual StringCross Strapping
– Stress TestingAvailability– Five 9’s– Available for Service 99.999% of the time– Can be down 5 Minutes and 16 Secs Per Year– Recovery - Quick
Serviceability– Field Debug– Field Component/Subsystem Replacement
Maintainability– Field Upgrade– Remote Firmware/Software Upgrade
Sam Siewert 4
Reliability
Quality Units– Hardware
EMI/EMC, Thermal Cycling, ESD, Radiation– Firmware
Hardware/Software IntegrationHardware Fault Injection and Simulation
– SoftwareRigorous Test Criteria – Test Exit RequirementsWhite-box Unit Tests
– Full path coverage testing– Statement coverage– Multiple Condition / Decision coverage
Black-box Unit and System Tests– Test Drivers and Test Harnesses– Feature Point and Function Testing– Stress Testing (Extremes, Corner Cases, Duration)
System Level TestsRegression Testing of Units and System
Sam Siewert 5
Reliability and Recovery
Redundancy– Dual String
Side A, Side BPilot, Co-Pilot
– Fail-OverFault Detection, Protection, Recovery
– Backup SystemIndependent Designe.g. Backup Flight System
– Cross Strapping of SidesDual String A & B3 Components C1, C2, C38 Possible Configurations4 Component SwitchesA|B Select Switch
C1 C1
C2 C2
C3 C3
A B
Configurations C1 C2 C3
1 A A A
2 A A B
3 A B A
4 A B B
5 B A A
6 B A B
7 B B A
8 B B B
SW1 SW2
SW3 SW4
Sam Siewert 6
High AvailabilityService Up-time is Figure of Merit– Number of Times down?– How long down?– Quick recovery is key
Hot or Warm Spare Equipment Protection– Fault Detection and Fail Over Without Service Outage– Excess Capacity
E.g. Diverse Routing in a NetworkOverlapping Coverage in Cell Phone SystemsOn-orbit spare satellites
Sam Siewert 7
Availability vs. Reliability
Are They the Same?– Are all Reliable Systems Highly Available?– Are all Highly Available Systems Reliable?
Reliability = Long MTBF– Mean Time Between Failure– FDIR When Failures Do Occur
Fault Detection, Isolation and RecoverySafingMTTR (Mean Time to Recover)
Availability = MTBF / (MTBF + MTTR) = % Uptime– MTBF = 8,766 hours (525,960 minutes)– MTTR = 5 minutes– Availability = 525,960 / (525,960 + 5) = 99.999% Uptime
Negative Tests - HA Design Criteria5 9’s – Out of Service no more than 5 Minutes per Year [Up 99.999% of the Time]
Availability = MTTF / [MTTF + MTTR]
MTTF ≈ MTBF
E.g. 0.99999 ≈ (365.25x60x24 minutes) / [(365.25x60x24) + 5 minutes]
MTBF=Mean Time Between Failures, MTTR=Mean Time to Recovery
Sam Siewert 8
Big iron lessons, Part 2: Reliability and availability:What’s the difference?
Apply RAS architecture lessons to the autonomicSelf-CHOP roadmap
AvailabilityTime System is Available for UseTime System is Unavailable for UseRatio of Time Available over Total Time Both Available and Not Available
Some Variation in Definitions of MTTF, MTBF, MTTR– MTBF = MTTF + MTTR [In Book]– MTTF is Simply Time Between Failures (Observed in Accelerated Lifetime / Stress Tests)– MTBF Often Stated – Does it Really Include MTTR or Not?– Typically Recovery Time is Small, So Negligible, But if Long, Difference is More Important– Just Be Clear on Definitions [Can Vary]
% of Time System is Available for Use Each Year (Or Other Arbitrary Period of Time)
Sam Siewert 9
available recovery available recovery
MTBFFailure #1 Failure #2
MTTF
MTTR
Sam Siewert 10
ServiceabilityField Anomaly Diagnosis– Trace capabilities– Debug capabilities– Error detection and prediction– Performance monitoring– Ease of Component, Sub-system Replacement– Diagnostics and Isolation
Sam Siewert 11
MaintainabilityField Upgrade Capabilities and LimitationsMake Before Break– Can I upgrade while Unit remains in service?– Can I test upgrade prior to service switch over?– Can I fall back to old configuration on failure of new?– E.g. – Upload new Boot Code, it Fails, Rolling Reboot Causes
Fail Over to Old Image
Sam Siewert 12
Component, Unit, Sub-System Reliability
Stress Testing
– HardwareThermal CyclingVacuum ChamberRadiation DosingMechanical Shock and Vibe Tests
– Firmware/SoftwareHardware sub-system or component failureRepeated Testing for Long Duration
– Memory Leaks– Race Conditions– Priority Inversions– Deadlock
Monte Carlo Analysis
Sam Siewert 13
Monte Carlo AnalysisRepeatable Pseudo Random Sequence– Can Re-run A Failure Case with Same Seed– Nominal Value +/- Dispersion
Dispersion is a Random VariationUniform or Normalized Distribution
– Drive Timing Variations– Drive Input Variations– Drive Simulation Variations
Sam Siewert 14
Beyond RASM?Autonomic Architectures4 Goals (self-CHOP)
– Self-Configuring– Self-Healing– Self-Optimizing– Self-ProtectingEight Features
1. “Self Knowing” – Status at All Levels, BIST, Used by Peers and in Hierarchy
2. “Configurable/Reconfigurable” – FPGA, Fusing, Cross Strapping
3. “Self Optimizing” – Built In PMU, Trace, SPA
4. “Self Healing” – Recovery, Fusing, Excess Capacity, Redundancy, Autonomous Repair
5. “Self Protecting” – Virus, Security, Data Integrity
6. “Discovery” – Finding and Identifying Resources
7. “Interoperable” – Component, Subsystem, System
8. “Self Managing” – Low or No Admin (Zero Uplink)
Sam Siewert 15
Motivation for Autonomic
Moore’s Law Reaching Limit?– GHz of CPU Cycles, Multiple CPUs– Gbits/sec of Bandwidth– GBytes of Memory– Terra Bytes of Storage
Ability to Manage These Resources Reaching Limit?
Maintenance Bigger Issue than Capacity, Throughput, and Bandwidth
Comparable to Evolution of Telecomm– Party Lines (Users Manage and Arbitrate for Resources Directly)– Operators for Exchanges Configure Connections (Centralized Management)– Switching Systems (Automated Operations, Service Provisioning)
Number of Administrators/Operators Per System?– N to 1 (Mainframes and Supercomputers)– 1 to 1 (Minicomputers, File Servers, Workstations, PCs)– 1 to N (Clusters, GRID Computing, RAID)
Sam Siewert 16
Autonomic Characteristics
Examples of Autonomic Features Today– CPU
Watch-dog timer for Software SanitySoftware recovery for missed service deadlineeFuse in circuit enable/disable for post manufacture reconfigurationPMU for on-chip performance monitoring
– IOSensor Fusion and Sensor/Actuator Cross strappingDiverse routing through networksCRC, timeouts, re-transmission
– MemoryEDAC memory
– StorageRAID5 volume rebuild on drive failureRAID6 dual failure handling
Sam Siewert 17
Embedded Autonomic?Are Autonomic Principles System Level Only?Do they Apply to Embedded Systems?To SoC?To Enterprise Systems?
Sam Siewert 18
Some Web LinksMission Critical SW/HW Component Examples
Intel Wind River Systems, OverviewXilinx TMR, TMRToolTI, Hercules Safety MCUsConcurrent, RedHawk Linux
RASM (Enterprise Systems)Oracle, Sun RASIBM, P-series RASIBM, Z-series RASMS, HA and DR for SQL ServerCisco, HA Overview
Autonomic TheoryIEEE ICAC 2019IBM Research - Autonomic Computing