47
Characterization and Mitigation of Failures in Complex Systems Center for Advanced Computing and Communications Dept. of Electrical & Computer Eng Duke University Durham, NC Dr. K. S. Trivedi Dr. Y. Hong

Characterization and Mitigation of Failures in Complex Systems Center for Advanced Computing and Communications Dept. of Electrical & Computer Eng Duke

Embed Size (px)

Citation preview

Characterization and Mitigation of Failures in Complex Systems

Center for Advanced

Computing and Communications

Dept. of Electrical & Computer Eng

Duke University

Durham, NC 27708-0294

Dr. K. S. Trivedi Dr. Y. Hong

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

System Dependability

Dependability

Impairments

Means

Attributes

Faults

Errors

Failures

Procurement

Validation

Availability

Reliability

Safety

Security

Fault prevention (FP)

Fault tolerance (FT)

Fault removal (FR)

Fault forecasting (FF)

Survivability+ Service

Performability+ Performance

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

System Dependability

FP

Non-probabilistic modeling

Probabilistic modeling

FTRedundancy: hardware,software,information,time

Diversity: data, design, environment

FF

FRVerification (testing)

Maintenance: corrective (repair) and preventive

Minimal maintenance

Applications

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Modeling and Analysis

Probabilistic Modeling

Methods & Tools

Fault removal

Fault preventionFault tolerance

Fault forecasting

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

A Research Triangle

ToolsTools ApplicationsApplications

TheoryTheory Modeling methods and

Numerical solutions:

CTMC, SRN, MRSPN

FSPN, NHCTMC, FTREE

SHARPE

SPNP

SREPT

SRA

Hardware and software:

Fault-tolerant Systems

Real-time Systems

Computer Networks

Wireless Networks

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Modeling Methods

Probabilistic modeling

Fault trees

DTMC, CTMC

Non-probabilistic modelingNonlinear dynamics

FSPN

SPN

SDE

NHCTMC

MRSPN

SRN

HS

MRGPSMP

FBM, FGN

HS

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

CTMC SMP MRGP

T ~ Exp

CTMC

t

state

T ~ Gen

SMP

t

state

T1 ~ Gen

MRGP

t

state

T1 ~ Gen

T ~ Exp

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Relation: Dependability Models

GSPN SRNCTMC

RG

FTRE

MRM

RBD FT

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Modeling of HS and FSPN Statistical multiplexed ATM switch and HS model:

Bt

B

High priority Low priority

TRAFFIC

ATM SWITCH

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Modeling of HS and FSPN

FSPN model for ATM switch

K

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Solution Techniques Algorithms

Fast algorithms for network reliability and fault tree computation

Fast algorithms for SRN Numerical and simulation tools

SHARPE (symbolic hierarchical automated reliability and performance evaluator)

SREPT (software reliability estimation prediction tool)

SPNP (stochastic Petri net package) SRA (software rejuvenation agents)

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Architecture of SHARPE

ftree

relgraph

block

graph pfqn, mfqn

Hierarchical & Hybrid Compositions

SMP/MRGPMarkov

GSPN/SRN

Reliability/availability Performance analysisPerformability

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Architecture of SPNPStochastic Reward Net (SRN) models

Markovian Stochastic Petri Net

Non-MarkovianStochastic Petri Net

Fluid StochasticPetri Net (FSPN)

Analytic-Numeric Method Discrete Event Simulation (DES)

Steady-State

Transient Steady-State Transient

SOR

Gauss-Seidel

Power Method

Std. UniformizationFox-Glynn MethodStiff Uniformization

Batch Means

Regenerative Simulation

Indep. Replication

Restart

Importance SamplingImportance Splitting

FastMethods

Reachability Graph

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Architecture of SREPT

SREPT

Black-boxapproach

Architecture-basedapproach

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Black-box Approach

ENHPP modelENHPP model

ENHPP modelENHPP model

Test coverageTest coverage

Estimate of theEstimate of thenumber of faultsnumber of faults

ENHPP modelENHPP model

Sequence of Sequence of Interfailure timesInterfailure times

PredictedPredicted1. failure intensity1. failure intensity2. #faults remaining2. #faults remaining3. reliability after release3. reliability after release4. coverage4. coverage

PredictedPredicted1. failure intensity1. failure intensity2. #faults remaining2. #faults remaining3. reliability after release3. reliability after release

Software Software Product/Process Product/Process

Metrics Metrics + Past experience+ Past experience

Fault DensityFault Density

Regression TreeRegression Tree

Optimization & Optimization & ENHPP modelENHPP model

Sequence of Sequence of Interfailure times Interfailure times & Release criteria& Release criteria

Predicted release timePredicted release timeand Predicted coverageand Predicted coverage

PredictedPredicted1. failure intensity1. failure intensity2. #faults remaining2. #faults remaining3. reliability after release3. reliability after release

Failure intensityFailure intensityDebugging policiesDebugging policies

NHCTMCNHCTMC

Discrete-event Discrete-event simulation simulation

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

ArchitectureArchitecture

Failure behavior Failure behavior of componentsof components

Analytical Analytical modelingmodeling

Discrete-event Discrete-event simulationsimulation

ReliabilityReliabilityandand

PerformancePerformancepredictionprediction

Architecture-based Approach

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Challenges

LargenessFault trees using BDD

Markov model

Non-exponential distributions

DES (Discrete-Event Simulation)

Joint continuous & discrete state spaces

Combined performance and reliability/availability MRM

FSPN

NHCTMC

SDE

Automated generation/solution based SPN, SRN

Hierarchical methods, fixed point iteration

SMP, MRGP

SRN

HS

Stiffness

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Applications For high-performance and high-dependability Software

Software reliability (FR,FF) Software rejuvenation (FF,FR) Software fault-tolerance (FT,FF)

Hardware (with software) Preventive maintenance (FR,FF) Availability model (FF) Upgrade design (FT,FF) Performability of wireless communications (FF) Transient behavior of ATM networks with failure

(FF) Error recovery in communication networks (FF)

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Software Reliability

Early prediction of quality based on software product/process metrics.

Reliability growth modeling based on failure and coverage data collected during testing

Architecture-based software reliability Developed a tool (SREPT)

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Software Rejuvenation

Definition: proactive fault management technique for the systems to counteract the effect of aging

Approaches: Periodic: Rejuvenation at regular (deterministic) time

intervals irrespective of system load Periodic and instantaneous load: Rejuvenation at

regular intervals and wait until system load is zero Prediction-based: Rejuvenation at times based on

estimated times to failure (implemented in IBM Netfinity Director)

Purely time-based estimation Time and workload-based estimation

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Analytical ModelingUndergoing Undergoing rejuvenationrejuvenationRecoveringRecovering AvailableAvailable

BB CCAA

Subordinated non-homogeneous CTMCSubordinated non-homogeneous CTMC for for t = t =

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Numerical Example

Service rate and failure rate are functions of time, (t) and (t)

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Measurement-based Estimation

Objectivedetection and validation of software aging

Basic ideaperiodically monitor and collect data on the attributes responsible for determining the health of the executing software

Quantifying the effect of aging proposed metric - Estimated time to

exhaustion

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Non-Non-parametricparametric Regression Smoothing of Regression Smoothing of Rossby data - Real Memory FreeRossby data - Real Memory Free

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Software Fault Tolerance Recovery block: common-mode software failures, common-mode acceptance test failures, failures clustered in the input stream A generalized model for recovery block

strategy to capture the dependencies: operate input within a recovery block

execution evaluate module output in the recovery block difficult inputs clustered in input space

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

SRN Model for Recovery Block

A recovery blockwith clustered failures

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Hardware Maintenance Conditioned based maintenance: CTMC: closed form solution SHARPE: numerical solution Find optimal inspection interval

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Hardware Maintenance

Comparison:Closed form result& Numerical result

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Hardware Maintenance Time based maintenance: SMP based solution (PM: preventive maintenance) Find optimal maintenance interval

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Availability Models

Availability models commonly used in practice assume that times to outage and recovery are exponentially distributed.

How accurate will the all-exponential models be for systems w/ limited information of outage/recovery?

Can we give availability bounds for such systems?

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Result of a General Model For a system with multiple types of outage-recovery

(non-exponentially distributed outrages), the underlying stochastic process is a semi-Markov process (SMP).

We give a closed-form formula of system availability.

Findings –

Only the mean value of time-to-recovery (E[TTR]) affects the availability of systems. The distribution does not matter.

However, the distribution of Time-to-outage (TTO) does affect availability.

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Upgrade with Redundancy

Active node

Standby node

Load-sharing clustering is common in networks (wired, wireless, optical) and server systems.

How to take advantage of the cluster structure and redundancy to upgrade hardware and software components?

How to quantify system performance for different upgrade schemes?

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Upgrade Schemes

Baseline model

Direct transfer:

Simple, but long downtime, for non-realtime-critical system.

Phased upgrade:

Almost zero downtime, upgrade paradox, version incompatibility, etc.

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Performability of Cellular Control Channel Protection

Each cell has Nb base repeaters (BR)Each BR provides M TDM channelsOne control channel resides in one of the BRs

Control channel down

System down(!)

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Automatic Protection Switch

Upon control_down, the failed control channel is automatically switched to a channel on a working base repeater.

control_down causes only system partially downno longer a full outage!

Asys Pblock PdropAPS

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Model of System w/o APS

CTMC State (b,k)b=No. of BR upk=No. of talking

channels

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Model of System w/ APS

A Segment of the Composite Markov Chain Model

CTMC State (b,k)b=No. of BR upk=No. of talking

channels

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Numerical Results

Handoff Call Blocking

Probability

Improvement by APS

Unavailability in handoff call

dropping probability

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

ATM Networks under Overloads

To quantify the effects of transients in ATM networks with • Markov Modulated Poisson (MMPP) used to allow for Correlated arrivals• Transient effects: relaxation time, maximum overshoot,and Expected excess number of losses in overload

ON0FF

1

0

10Low

High

2-state MMPP traffic model

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Queuing Model for ATM

A failure occurs in the connection (N1 and N3).Therefore, the traffic destined for N3 is re-routed

through N2

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

SRN Model for ATM

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Numerical Results

Loss probability vs. burstiness Queue length vs. burstiness

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Error Recovery in Communication

1:N protection switching:

Study error recovery in communication networks using (non-exponential detection/restoration times) MRGP

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Two Modeling Methods

CTMC is used with ignoring the detection and restoration times

MRGP (a non-Markovianmodeling) is used with including the detection and restoration times

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Numerical Result

Comparison of CTMC and non-Markovian model

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

Open Problems Theoretical studies in the modeling and

analysis of complex systems and failures Capabilities, limitations, and relationships among

formalisms such as FSPN and HS, FSPN and SDE. Fast algorithms

Fast algorithms for FSPN Fast algorithms for discrete-event simulation Fast algorithms for non-Markovian models

Applications Software: further exploration of software rejuvenation Performability analysis of (wired and wireless)

networks

Center for Advanced Computing and CommunicationDepartment of Electrical and Computer Engineering, Duke

University

The EndThank you!