{mohamed.kaaniche, karama.kanoun computer-based ...resist.isti.cnr.it/files/corsi/courseware_slides/dependability_and... · Generic failure modes (IEC 812-1985) 1. Structural failure

$Page 1: {mohamed.kaaniche, karama.kanoun computer-based ...resist.isti.cnr.it/files/corsi/courseware_slides/dependability_and... · Generic failure modes (IEC 812-1985) 1. Structural failure$
1ReSIST courseware — M. Kaâniche, K. Kanoun, J-C. Laprie — Dependability and Security Evaluation

Dependability andSecurity evaluation of

computer-based systemsMohamed Kaâniche, Karama Kanoun, Jean-Claude Laprie

{mohamed.kaaniche, karama.kanoun, jean-claude.laprie}@laas.fr


Focus of the lecture

Attributes

Threats

Means

Availability

Reliability

SafetyConfidentiality

Integrity

Maintainability

Faults

Errors

Failures

Fault Prevention

Fault Tolerance

Fault Removal

Fault Forecasting

Security

Dependabilityaccidental

+malicious


Outline

� Introduction

� ordinal and quantitative evaluation

� Definitions of quantitative measures

� Probabilistic evaluation methods

� Combinatorial models: reliability block diagrams, fault trees

� State-based models: Markov chains

� Case studies

� Evaluation with regard to malicious threats

� Conclusion

� References


Dependability evaluation

� Estimate the present number, the future incidence andthe likely consequences of faults

� Assess the level of confidence to be placed in the targetsystems with regards to their ability to meet specifiedobjectives

� Support engineering and design decisions

� comparative evaluation of candidate architectures

� prediction of the level of resilience in operation

� reliability, resource and cost allocation based onquantified predictions


Two types of evaluation

Qualitativeor “ordinal”

Quantitativeor “probabilistic”

identify, classify, and rank thefailure modes or the eventcombinations (componentfailures or environmental

conditions) that would lead tosystem failures

evaluate in terms ofprobabilities the extentto which some of the

attributes are satisfied

attributes � measures


Place in the life cycle process

dependabilityevaluation

Development

Operation

Requirements

Design

Implementation &Integration

Specification of dependabilityobjectives, preliminary analysis ofthreats and failure modes

Comparative assessmentof architectural solutions

Measurements based oncontrolled experiments andtesting

Follow-up of systemdependability based onoperational data


Hardware development model

[BSI 1985] Reliability of manufactured products, Systems, Equipment and Components, Part 1. Guide to Reliabilityand Maintainability Programme Management, BS 5760, British Standard Institution, 1985

Feasibilitystudy

Reqts

Design

DEFINITION DESIGN PRODUCTION OPERATION

Specif. &contract

RReqts

RSpecif. &contract

R Reliability

QC Quality Control

M Maintenance

Material &component

analysis

Constraint &worst case

analysis

Redundancyanalysis

AnalysisM

EvaluationR

Designreview

Sub-assy-manufacture

Dev. modelmanufacture

Sub-assytest

Performance test

Environment test

Accelerated test

Endurance test

DemonstrationR

Re-design/Modify

EvaluationR

EvaluationR

Manufacture

QC test

Data

Operate

Screeningburn-in

Designreview Demonstration

R

Data

EvaluationR

Data

Mainte-nance


Avionics: ARP 4754 Standard

Aircraft levelFunctional HazardAssessment (FHA)

System SafetyAssessment

(SSA)

Aircraft LevelRequirements

Allocation ofAircraft Functions

to Systems

Development ofSystem Architecture

Allocation ofRequirements

to HW&SW

SystemImplementation

Prelim. SystemSafety Assessment

(PSSA)

CommonCause

Analyses

(CCAs)

Physical System

Implementation

Item ReqtsItem Reqts,Safety objectives,Analyses Required

Failure modes, Effects,Classification, Safety Reqts

Separation Reqts

Separation&

Verification

Failuremodes &Effects

FunctionalInteractions

Architectural Reqts

SystemArchitecture

AircraftFunctions

SystemFunctions

System Development ProcessSafety Assessment Process

System levelFHA

Certification

Failure condition, Effects,Classification, Safetyobjectives


Dependability requirements andquantitative objectives

� Specification of dependability objectives

� Critical dependability attributes, degraded modes

� Qualitative AND/OR Quantitative objectives

� Telecommunication system� Unavailability: 2 minutes/year � availability = 99.996%

� Less than 0.01 % of phone call not processed properly

� Mean time to repair = 4 hours

� Space system (NASA)� “The on-orbit space station shall be capable of operating in the

microgravity mode, as defined in .., for 30 day continuousperiods per the mission profile … with a reliability of 0.80”

� “Satellite communication shall provide 6 out of 12 channels ofdownlink at 10Mbits/sec rate for two years of continuousoperation”


Consequence

Minor

Major

Hazardous/Severe-Major

Catastrophic

Probable

Rare

Extremely rare

Extremely improbable

Probability of occurrence (per hour)

> 1.10-5

1.10-5 > > 1.10-7

1.10-7 > > 1.10-9

< 1.10-9

Airborne systems

Minor Failure conditions which would not significantly reduce aircraft safety, and which would involvecrew actions that are well within their capabilities. Minor failure conditions may include, forexample, a slight reduction is safety margins or functional capabilities, a slight increase in crewworkload, such as, routine flight plan changes, or some inconvenience to occupants

Failure conditions which would prevent continued safe flight and landing

Failure conditions which would reduce the capability of the aircraft or the ability of the crew tocope with adverse operating conditions to the extent that there would be, e.g.,, a significantreduction in safety margins or functional capabilities, a significant increase in crew workload orin conditions impairing crew efficiency, or disconfort to occupants, possibly including injuries.

Failure conditions which would reduce the capability of the aircraft or the ability of the crew tocope with adverse operating conditions to the extent that there would be:

a) a large reduction in safety margins or functional capabilities

b) Physical distress or higher workload such that the flight crew could not be relied on to perform their task accurately or completely, or

c) Adverse effects on occupants including serious or potentially fatal injuries to a small number of those occupants

Hazardous/Severe-Major

Catastrophic

Major


Evaluation methods

Modelling(analytical models,

simulation)

Measurements(controlled experiments,

data from operation)parameter estimation

model validation

�relevantparameters to

measure

Ordinal evaluation Probabilistic evaluation

FMECA

Reliability block diagrams

Faut Trees

State-based models

Markov chains

Petri nets


FMECA Failure Modes, Effects, and Criticality Analysis

� Initially used for Hardware, then extended to software(SEEA: Software Error Effect Analysis)

�What can FMECA be used for?

� Identify for each component, or function, .. potentialfailure modes and their consequences on the system� failure mode = the way a failure manifests

� Assess the criticality of each failure mode� failures prioritized according to how serious their

consequences are and how frequently they occur

� Identify possible means to prevent or reduce the effects ofeach failure mode

� Define validation tests to analyze such failure modes


Generic failure modes (IEC 812-1985)1. Structural failure (rupture)2. Physical binding or jamming3. Vibration4. Fails to remain in position5. Fails to open6. Fails to open7. Fails open8. Fails closed9. Internal leakage10. External leakage11. Fails out of tolerance (high)12. Fails out of tolerance (low)13. Inadvertent operation14. Intermittent operation15. Erratic operation16. Erroneous indication17. Restricted flow18. False actuation

19. Fails to stop20. Fails to start21. Fails to switch22. Premature operation23. Delayed operation24. Erroneous input (increased)25. Erroneous input (decreased)26. Erroneous output (increased)27. Erroneous output (decreased)28. Loss of input29. Loss of output30. Shorted (electrical)31. Open (electrical)32. Leakage (electrical)33. Other unique failure conditions as

applicable to the systemcharacteristics, requirements andoperational constraints


OS, Service, Application failure modes� Kernel crash

� machine hosting the service does not respond to requests

� Service crash

� attempts to establish a connection to the service are refused(the process implementing the service died)

� Service hang

� the service accepts the incoming connections, but does not replywithin a time limit

� Application failure

� the service returns incorrect results to clients leading toapplication failure

� Exception

� an invocation of the service results in an exception being raised

� …


Software errors: Spot 4

� Initialization of parameters and variables

� Computation errors: integer, real, …

� Erroneous setting of boolean variables

� Decision logic error upon the selection between severalalternatives after a test

� Corrupted input data

� Invalid outputs sent to another component

� …


Criticality {severity, frequency}

Frequency Severity

Catastrophic Critical Marginal Negligible

Probable

Occasional

Remote

Improbable

Incredible

Frequent Class I

Class IV

Class II

Class III

Example: IEC- 61508-5 standard

Untolerable risk. Risk reduction measures are required

Undesirable risk, tolerable only if risk reduction is impractical or ifcosts are disproportionate to the improvment gained

Negligible risk

Class I

Class II

Class III

Class IV

Tolerable risk if the cost of risk reduction would exceed theimprovment gained


Acceptable risk

� Various principles exist to determine acceptable risk:

� MEM: “Minimum Endogenous Mortality”� a new or modified system shall not "significantly" increase a

technologically caused death rate for any age group

� GAMAB: “Globalement au Moins Aussi Bon”, “Globally atleast as good”� any new solution shall in total be at least equally good as

the existing one

� ALARP: “As Low As Reasonably Practically”

� …


Tolerable/Acceptable risk and ALARPALARP: As Low As Reasonably Practical

(Risk is undertakenonly if a benefit is

desired)

Risk cannot be justifiedexcept in extraordinarycircumstances

Unacceptableregion

ALARPtolerability region

Broadlyacceptable region

Negligible risk

Tolerable only if further riskreduction is impractical or ifits cost is disproportionateto the improvment gained

It is necessary tomaintain assurancethat risk remains atthis level

Developmentand validation

costs Class I

Class II

Class III

Class IV


FMECA steps

� Breakdown the system into components

� Identify the functional structure and how thecomponents contribute to functions

� Define failure modes of each component, their causes,effects and severities

� Local effect: on the system element under study

� Global effect: on the highest considered system level

� Enumerate possible means to detect and isolate thefailures

� Identify mitigation actions to prevent or reduce theeffects of failure at the design level or in operation


FMECA Worksheet

Ref.n° function

failuremode

failurecause

local

Description of unit

operationalmode

Description of failure Failure effect

global

Probabilityof occurrence

Criticalitylevel

Commentsdetectionmeans

correctiveactions

Detection & mitigation


FMECA pros and cons

� Pros

� applicable at the early design stage

� detailed information about failure modes end their effects,during the various stages of the system development

� contribution to the prevention of design faults and to thedefinition of fault tolerance requirements and needs

� useful inputs for validation testing

� results can be used to aid in failure analysis andmaintenance during operation

� Cons

� not suitable for multiple failures

� may be tedious for complex systems .. but .. necessary


Probabilistic evaluation

Quantitative Measures

Reliability, Availability, Safety, …

Stochastic Model(s)

System behavior wrt to faultoccurrence or activation, fault tolerance

and service restoration strategies, ..

Data

characterizing elementary processes(components failure rates, error detection

rates, recovery & repair rates, …)

• Model construction• Model solution• Model validation

Controlledexperiments

Field data(operation, feedbackfrom similar systems)

Expert judgments


Dependability measures

Service implementssystem function:

correct service

X = 1

Service does not implementsystem function:

incorrect service

X = 0

failure

restoration

Quantitative measures

Measure ofCONTINUOUSdelivery of a

correct service(time to failure)

MAINTAINABILITY RELIABILITY

Measure of deliverycorrect service

relative toALTERNATION

correct/incorrect

Measure ofCONTINUOUSdelivery of an

incorrect service(time to restoration)

AVAILABILITY


Dependability measures

Reliability: Rk(u)=Prob. {θk > u} = Prob.{X(τ)=1 ∀ τ ∈ [tk-1, tk-1+u]}

Availability: A(t) = Prob. {X(t) = 1} = E {X(t)}

Maintainability: Mk(u) = Prob. {ξk ≤ u}

X(t)

1

0

timet1t0=0 t2 tk-1 tk

θ1 ξ1 θ2 ξ2 θk ξk

tk-1 t tk-1+θk

u


Multi-performing systems

� More than two service delivery modes

� Correct service � progressive performance degradation

� Incorrect service � failure consequences

� X = {x1, x2, …xn}

� xk: service delivery modes (accomplishment levels)

� two extreme cases� 1 correct service mode — several incorrect service modes� Several correct service modes — 1 incorrect service mode

� xk are usually ordered, order induced by� performance levels: perf(x1) > perf(x2) > … > perf(xn)� criticality levels: crit(x1) < crit(x2) < … < cri(xn)

⇓x1 > x2 >…> xn


Multi-performing systems: measures

� Reliability-like measures: continuous delivery of service according to modes {x1, …xp}

(time to service delivery in modes {xp+1, …xn})

x1 xp…… xn……

Rp(t) = Prob.{X(τ) ∈ {x1, …, xp} ∀ τ ∈ [0, t] }

� Availability-like measures:service delivery according to {x1, …xp} relative to alternationbetween modes {x1, …xn}

Ap(t) = Prob.{X(t) ∈ {x1, …, xp} }


Particular cases

� 1 correct service mode: x1 = c

� 2 incorrect service modes with very different severity levels

� Benign incorrect service: x2 = ib� Catastrophic incorrect service: x3 = ic

R1(t) = Prob.{X(τ) = c ∀ τ ∈ [0, t] } � Reliability

R2(t) = Prob.{X(τ) ∈ {x1, x2} ∀ τ ∈ [0, t] } � Safety

x1=c

x2=ib

x3=ic

catastrophic failure

benign failure

restoration


Stable reliability Reliability growth

Identical stochastic distributionsof times to failure θk:

Prob.(θk < t) = Prob.(θk-1 < t)

after restoration,system identical to what it was

at previous restoration

progressive removal ofresidual development faults

(without introductionof new faults)

failed componentsreplaced by identical

non failed components

Hardware Software

restart Reliability growth

time

Time tofailure

Stochastic increase of times to failure θk:

Prob.(θk < t) < Prob.(θk-1 < t)

time

Timetofailure

time

Failureintensit

ytime

Failureintensity


Time to failure characterizationθ : time to failure random variable

(similar analysis when considering time to restoration ξ)

name symbol definition properties

Distributionfunction

ComplementaryDistrib. function(survival funct.)

F(t) Prob.(θ ≤ t)

R(t) Prob.(θ > t)

Probabilitydensity function f(t)

f(t).Δt= Prob.(t <θ≤ t+Δt)

f(t) = =dF(t)

dt

-dR(t)

dt

hazard ratez(t)

z(t).Δt= Prob.(θ≤t+Δt |θ>t)

z(t) =1

R(t)

-dR(t)

dt

monotonous increasingfunction: F(0)=0 F(∞)=1

monotonous decreasingfunction: R(0)=1 R(∞)=0

∫ f(t).dt = 10

∞

Mean time to failure : E(θ)=MTTF = ∫ t.f(t).dt0

∞= ∫ R(t).dt

0

∞


Statistical estimationN identical systems tested under the same conditions during [t0, t]

� Times to failure identically distributed: F(t)

� N(t0) = Number of systems delivering correct service at t0 = N

� N(t) = Number of systems delivering correct service at time t

N independent executions

System fails before t System operational until t

F(t) R(t)

� N(t): random variable with Binomial distribution

nProb. [ N (t)=n ] = C [Rc(t)] • [1- R (t)]N

N-nn

E [ N(t)] = N. R(t)N(t)

N(t0)R(t, t+u) =

N(t+u)

N(t)⇒Estimator : R(t) =

^ ^

Bernouilli trials


Statistical estimation

S1

S2

S3

S4

S5

S6

S7

N(t0) = 7 (1 - 7)

N(t) = 4 ( 3, 4, 5, 6)

N(t+u)= 2 ( 4, 6)

R(t) =N(t)

N(t0)= 1 -

N(t0) - N(t)

N(t0)⇒ F(t) =

N(t0) - N(t)

N(t0)

tt0 t+u

✰

✰

✰

✰

✰

✰

✰

^ ^

Example :

R(t) =N(t)

N(t0)^

=4

7

2

4R(t, t+u) =

N(t+u)

N(t)^

=


Statistical estimation: f(t)

N(t0) = 7 (1 - 7)

N(t) = 5 (1, 3, 4, 6, 7)

N(t+Δt) = 3 (1, 3, 6)

f(t) = [ 1 -d dt N(t0)

]E[N(t)] = -1N(t0)

d E [ N(t)]dt

= limΔt = 0

E [ N(t)] - E [ N(t+Δt)]

N(t0) . Δt

S1

S2

S3

S4

S5

S6

S7

tt0 t+Δt

✰

✰

✰

✰

✰

✰

✰

Estimator : f(t) =^ N(t) - N(t+Δt)

N(t0) . Δt

f(t) =^ (5-3)

7

2

7=

f(t) = Prob.( t < θ≤ t+dt ) =dF(t)

dt

d [ 1-R(t)]dt

=


Statistical estimation: z(t)

N(t0) = 7 (1 - 7)

N(t) = 5 (1, 3, 4, 6, 7)

N(t+Δt) = 3 (1, 3, 6)

z(t) = -1E [N(t)]

d E [ N(t)]dt

= limΔt = 0

E [ N(t)] - E [ N(t+Δt)]

N(t) . Δt

S1

S2

S3

S4

S5

S6

S7

tt0 t+Δt

✰

✰

✰

✰

✰

✰

✰

Estimator : z(t) =^ N(t) - N(t+Δt)

N(t0) . Δt

z(t) =^ (5-3)

5

2

5=

z(t) = d [ 1-R(t)]R(t). dt


Statistical estimation: MTTF

0 ≤ θ(1) ≤ θ(2)≤ …… ≤ θ(N)

^ 1

NEstimator : MTTF = [ θ (1)

+ θ(2) +…… + θ(N) ]= Σ1

N

N

i = 1

θ(i)

S1

S2

S3

.

.

.

SN

time

✰

✰

✰

✰

✰

✰

✰

θ(2) θ(N)θ(1)


Relationships between measures

F(t)

F(t)

R(t)

f(t) -dR(t)

dt

z(t) 1

R(t)

-dR(t)

dt

∫ f(x).dx0

t

F(t) f(t) z(t)

1-R(t)

∫ f(x).dxt

∞

∫ −z(x).dx0

t1-exp.

1-F(t) ∫ −z(x).dx0

texp.

dF(t)

dt∫ −z(x).dx0

tz(t) exp.

1 dF(t)

dt1-F(t)

f(t)

∫ f(x).dxt

∞

—

—

—

—


Operation : λ(t) ≈ constant = λ

λ(t)

t

design operation wearout

accidental faultsDevelopmentfaults, production

defects

Hardware components failure rate


Constant failure rate� z(t) = constant = λ ⇒ θ exponentially distributed

f(t) = λ exp.(-λt)

E(θ) = MTTF = 1/λ

R(t) = exp.(-λt)

⇒z(t) =1

R(t)

-dR(t)dt

= λ F(t) = 1 - exp.(-λt)

Prob. {correct service until t+T | correct service until t) =Prob.(correct service until t+T)

R(t+T)

R(t)= = exp (-λT)

� Exponential distribution: memoryless property

Prob.(correct service until t)

� Number of failures per unit of time: a Homogeneous Poisson Process


MTTF, MTTR, MUT, MDT, MTBF

� MUT: mean up time

� MTBF: mean time betweenfailures

MUT = limk=∞

E{θ1} + E{θ2} +… + E{θk}

k

� MDT: mean down time MDT = limk=∞

E{ξ1} + E{ξ2} +… + E{ξk}

k

MTBF = MUT + MDT ≈ MUT

� MTTF: “mean time to failure”: MTTFk = E{θk}

� MTTR: mean time to restoration MTTRk = E{ξk}

θk ξk


Availability

� A(t) = Prob. {X(t) = 1} = E {X(t)} A(t) = 1-A(t) = Prob. {X(t) = 0}

� U(T): cumulated uptime (“correct service delivery time”) in [0,T]

Average Availability in [0,T] = proportion of cumulated uptime in [0,T]

E {U(T)} =1T

E { X(t)dt} =1T

∫ T

0E {X(t)}dt =1

T∫ T

0Aav(T)A(t) dt =1

T∫ T

0

� Steady-state Availability:

A = MUT

MUT+MDTA = 1-A =

MDT

MUT+MDT

Availability

UnavailabilityDowntime (min/year)

0.99

5256

0.01

0.999

525.6

0.001

0.9999

52.56

0.0001

0.99999

5.256

0.00001

0.999999

0.5256

0.000001

A= lim A(t) = lim Aav(T)t=∞ T=∞

⇐ stable reliability


Single component system� Failure rate: λ ⇒ MTTF = 1/λ

A(t+dt)= Prob. (correct service at t AND no failure in [t, t+dt] ) + Prob. (incorrect service at t AND restoration in [t, t+dt] )

A(t+dt) = A(t) (1- λdt) + (1-A(t)) μdt

dA(t)

dt= μ - (λ+μ)A(t)⇒

� Reliability: R(t) = F(t)= exp.(-λt)

� Restoration rate: μ ⇒ MTTR =1/μ

� Availability: A(t)

A(t)

1

0

A(0)=1μ

λ + μ

≈ 1/μ

exp. (-(λ + μ)t)μ

λ + μ+

λ

λ + μA(t)=

t

A(0)=0

μ

λ + μ[1-exp. (- (λ + μ)t]A(t)=


System-Level DependabilityModeling Techniques

Probabilistic Evaluation


Multi-component systems modelling

� Model construction � describe system behavior takinginto account:

� structure

� components failures, fault tolerance/restoration strategies

� Model processing � evaluate quantitative measures

Combinatorial models(Reliability block diagrams,

fault trees, reliability graphs, ..)

State-based models(Markov chains, Stochastic Petrinets, non-Markovian models, ..)

Stochasticallyindependent components

Stochasticdependencies


Stochastic independence/dependence

� Stochastic independence: the occurrence of an eventdoes not affect the probability distribution of occurrenceof other events

� Stochastic dependence: the occurrence of an event doesaffect the probability distribution of occurrence of otherevents

� Events A and B :

� A, B independent ⇔ Prob. {A∩B} = Prob.{A} . Prob.{B}

� A, B dependant ⇔ Prob. {A∩B} ≠ Prob.{A} . Prob.{B}


Reliability Block Diagrams� Graph topology describing how components reliability

affect system reliability

� each component represented as a block

SERIESnon-redundant systems

Component1

Components whose failure lead tosystem failure

Component2

Componentn

PARALLELredundant systems

Component1

Component2

Componentn

Components whose failure lead tosystem failure only in

combination with others

GRAPH PATH ⇔ CORRECT SERVICE


Model processing

Rk: component k reliability, k = 1, …, n R: system reliability

� SERIES SYSTEMS

R= Prob. {system non failed}R= Prob. {comp. 1 AND comp. 2 non failed AND comp. n non failed}

Stochastically independent components ⇒ R = Π {comp. k non failed}k=1..n

R = Π Rkk=1..n

∫ −λ(x).dxt

Rk(t)= exp.0

R(t)= exp.{- Σ λk(x).dx}k=1..n

⇒ λ (t)=- Σ λk(t)k=1..n

identical components with λk(t)= λ ⇒ MTTF= 1/(nλ)


Model processingModel processing

�PARALLEL SYSTEMS

� System failed only when All components failed

1- R = Π (1 - Rk) ⇒ R = 1 - Π (1 - Rk)

Rk (t) = exp( - λk t) ⇒ R (t) = 1 - Π (1 -exp( - λkt))

Identical components: ⇒ MTTF = {Σ 1/k }. 1/ λ

The MTTF of a parallel system with n components is higher than the MTTFof a single component ( n = 2 ⇒ improvement factor 1.5 )

k=1, .,n

k=1..n k=1..n

k=1..n


Parallel-Series systems

C21

C23

C22

C111

C12

C112

C11

C1 C2

R 11 = R 111 . R 112

R 1 = 1 - (1 - R 11) . (1 - R 12)

R 2 = 1 - (1 - R 21) . (1 - R 22) . (1 - R 23)

R = R 1 . R 2


“k-out of-n” systems with voter

� n components and a voter

� System non failed when less than k components failed

C1

C2

Cn

k/n..

RC: reliability of Component j j=1,…n (identical comp.)

Rv: reliability of Voter

R: system reliability

R (t) = [ C [Rc(t)] • [1- Rc(t)] ] • Rv j=k

∑n

jn

n-jj


TMR systems� “2-out of- 3” system with perfect voter

R (t) = 3 [Rc(t)] 2 - 2 [Rc(t)]

3

Rc(t) = exp (-λ t) ⇒ R (t) = 3 exp(-2λ t) - 2 exp(-3λ t)

MTTFc = 1 / λ > MTTF = 5 / 6 λ

MTTF MTTFc

1 2

1

0.5

λ t

Rc(t)

R(t)

useful region

0

!

Be careful when usingMTTF to characterize

dependability


More complex structuresMore complex structures

� transform into a “series-parallel” topology

C2 operational at t : C2 failed at t :

R(t) = Prob. { system operational during [0, t] / C2 operational at t } . R2 + Prob. {system operational during [0, t] / C2 failed at t } . (1- R2)

R(t) = {1 - (1- R4).(1- R5) }. R2 + {1 - (1- R1R4).(1- R3R5) } . (1- R2)

C1

C2

C3

C4

C5

C4

C5

C1

C3

C4

C5


Availability evaluation

� The same approach can be applied provided that thecomponents are stochastically independent with respectto failures AND restorations ⇒ 1 repairman percomponent

Ak: component k availability, k=1, …, n

A: system availability

A = Π Akk=1..n

A = 1- Π {1- Ak}k=1..n

Series systems:

Parallel systems:


System with n components and m replicas ofeach component

1 2 n

1 2 n

1 2 n

n components

mcomponents

1

1

1

n components

2

2

2

n

n

n

Architecture GR: Global Redundancy Architecture LR: Local Redundancy

ri(t) = Reliability of component i

R t r tGR i

i

nm( ) [ ( )]= − −

=

∏1 11

R t r tLR im

i

n

( ) [ ( ( )) ]= − −=

∏ 1 11



2) m = n = 2 ri (t) = r (t)

ΔF = RLR(t) – RGR(t)

R t r t r t r tGR ( ) [ ( ( )) ] [ ( )] [ ( )]= − − = −1 1 22 2 2 4

R t r t r t r t r tLR ( ) [ ( )] [ ( )] [ ( )] [ ( )]= − −( ) = − +1 1 4 42 2 2 3 4

ΔF r t r t r t= − + ≥2 2 1 02 2( ) [ ( ) ( ) ]

RLR(t) ≥ RGR(t)



� Conclusion

� The Local Redundancy architecture offers m possibleconfigurations for the system to be reliable, compared tomn for the Global Redundancy architecture

� It is better to apply redundancy at the component levelrather than at the system level


Fault trees

� Deductive top-down approach: effects ➣ causes

� Starting from an undesirable event, represent graphicallyits possible causes (combinations of events)

� Combination of events: Logical gates

AND OR

Basic gates

E E

E1 E2 E1 E2

Decomposable event:

Elementary event:


Example

� System: 3 components X, Y, Z� X, Y: parallel

� Z: series with {X,Y}

X

YZ

Elementary events: failures of X, Y, ZSystem event: system failure

System failure

{X,Y} failureZ failure

X failure Y failure


Model processing

Stochastically independent components

� AND gate

� Output event E occurs when input events E1 AND E2 AND … En occur

� Two elementary events:

E = E1∩ E2 ∩ … ∩ En

Prob.(E) = Prob.(E1) . Prob.(E2). … . Prob.(En)

� OR gate

� Output event E occurs when input event E1 OR E2 … OR En occur

E = E1∪ E2 ∪ … ∪ En

E = E1∩ E2 ∩ … ∩ En

Prob.(E) = 1 - [1 - Prob.(E1)] . [1 - Prob.(E2)]. … . [1 - Prob.(En)]

E = E1∪ E2

Prob.(E) = Prob.(E1) + Prob.(E2) - Prob.(E1). Prob. (E2)]


ExampleSystem failure

{X,Y} failureZ failure

X failure Y failure

[1 - RX] [1 - RY]

1 - RZ

1 - RX 1 - RY

1 - R = 1 - RZ + [1 - RX] . [1 - RY] - [1 - RZ] [1 - RX][1 - RY]

R = RZ [RX + RY - RX RY]

⇓


Minimal cut sets

� Cut set

� set of events whose simultaneous occurrence leads to theoccurrence of the top event of the tree

� Minimal cut-set

� Cut-set that does not include any other

� Order: number of events of the cut set� Order 1: a single event could lead to Top event

� Each minimal cut set of a fault tree describes significantcombination of faults that could lead to system failure

� Critical components

� Identify design weaknesses � redundancy needs


Minimal cut set computation: Boolean algebra

A ∩ A = A

A ∪ A = A

A ∪ B = A si A ⊃ B

A ∩ B = B si A ⊃ B

A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C)

A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C)

A ∪ B = A ∩ B

A ∩ B = A ∪ B

A ∪ (A ∩ B) = A ∪ B

A ∩ (A ∪ B) = A ∩ B


Minimal cut sets: example

E3 = B∪C T = E1∩E2 = (A∪B∪C) ∩ (C ∪ (A∩B))

E1 = A∪(B ∪ C) = A∪B∪C T = (A∪B∪C) ∩ C ∪ (A∪B∪C) ∩ (A∩B)

E4 = A∩B T = (A∩C) ∪ (B∩C) ∪ C ∪ (A∩B) ∪ (A∩B) ∪ (A∩B∩C)

E2 = C ∪ (A∩B) T = C ∪ (A∩B)

T

E1 E2

E3A

B C

E4C

A B


Reduced fault tree

Prob.{T}= Prob.{C∪(A∩B)}= Prob.{C}+Prob.{A} Prob.{B} - Prob.{A} Prob.{B} Prob.{C}

T

EC

A B


Cut sets: Reliability computation

Prob.(T)bounds: ∑ Prob.{Ci} - ∑ ∑ Prob.{Ci ∩ Cj}≤ Prob.{T}≤∑ Prob.{Ci}

i = 1

m

j = 2

m

i = 1

j-1

i = 1

m

Prob.{T} ≈ ∑ Prob.{Ci}i = 1

m

Ci minimal cut set - order mi : Ci = E1i ∩ E2i ∩ … ∩ Emi

Emi : basic events T : top event

Prob.{T} = P{C1 ∪ C2 ∪ … Cm }

Prob.{T} = ∑ Prob.{Ci} - ∑ ∑ Prob.{Ci ∩ Cj}

+ ∑ ∑ ∑ Prob.{Ci ∩ Cj} + … (-1)m Prob.{Ci ∩ Cj … ∩ Cm}

j = 2

m

i = 1

j-1

i = 1

m

k = 3

m

j = 2

k-1

i = 1

j-1

If probability of occurrence of basic events small:


Reliability block diagrams & Fault trees

� Pros

� useful support to understand system failures andrelationships with components failures

� Model processing is easy: powerful tools exist

� Helpful to identify weak points in the design

� Cons

� Components and events should be stochasticallyindependent� some extensions take into account some kinds of

dependencies (e.g., extended fault trees)


Stochastic dependencies: example� system with two components: C1 (primary), C2 (standby)

λ1

C1

λ2(t)

λ2sb

λ2a

t

C1 failure

C2

C2 standby C2 active

λ1(t)

0 ⇒ λ1 exp (- λ1τ) dτ exp (- λ2sbτ) exp (- λ2a (t - τ))∫ t

R(t) = exp (- λ1t) + λ1 exp (- λ2a t ) ∫ exp (- (λ1+ λ2sb- λ2a)τ) dτ0

t

R(t) = Prob. {System operational during [0, t]}� C1 operational during [0, t] ⇒ exp (- λ1t)� C1 fails at τ AND C2 nonfailed during [0, τ ] AND C2 operational during [τ, t]

R(t) = exp (- λ1t) + λ1 exp (- λ2a t ) { (1- exp (- (λ1+ λ2sb- λ2a)t}/(λ1+ λ2sb- λ2a)


State-based models

� Example

� system: two components X,Y; 1 repairman per component

� component states:� Xc, Yc (correct service); Xi, Yi (incorrect service)

� system states: (Xc, Yc), (Xi, Yc), (Xc, Yi), (Xi, Yi)

failure Y ( Xc, Yi )

( Xi, Yc )

( Xi, Yi )( Xc, Yc )

restoration Yrestoration X

failure X

failure X failure Y

restoration Xrestoration Y

①

②

④

③

Redundant system:� Correct service: ①, ②, ③� Incorrect service: ④

Nonredundant system:� Correct service: ①� Incorrect service: ②, ③, ④


Reliability model

� Computation of Pj(t) depends on the probability distributionsassociated to state transitions

� Homogeneous Markov chains: constant transition rates

Availability model


( Xi, Yc )


restoration Y

failure X

failure Xfailure Y

restoration X

①

②

④

③


( Xi, Yc )


restoration Yrestoration X

failure X

failure Xfailure Y

restoration Xrestoration Y

①

②

④

③

A(t) = P1(t) + P2(t) + P3(t)R(t) = P1(t) + P2(t) + P3(t)

Pk(t): probability system in state k at t


Homogeneous Markov chains

failure rates: λX, λYrestoration rates: μX, μY

P1(t+dt) = [1 - (λX dt + λY dt)] P1(t) + μY dt P2(t) + μ X dt P3(t)

• 1 - (λX dt + λY dt) = Prob.{stay in ① during [t, t+dt]}

• μY dt = Prob.{ transition ② to ① during [t, t+dt]}

• μX dt = Prob.{ transition ③ to ① during [t, t+dt]}

P2(t+dt) = λY dt P1(t) + [1 - (λX dt + μ Y dt)] P2 (t) + μX dt P4(t)

P3(t + dt) = λX dt P1(t) + [1 - (λY dt + μX dt)] P3(t) + μY dt P4(t)

P4(t + dt) = λX dt P2(t) + λY dt P3(t) + [1 - (μX dt + μY dt)] P4(t)

Only one transition might occur during [t, t+dt]:

λ X

λ Y

μ Y

μ X

μ X

λ X

λ Y

μ Y

1

2

3

4


Availability computation

λ X

λ Y

μ Y

μ X

μ X

λ X

λ Y

μ Y

1

2

3

4

P’1(t) = - (λX + λY)P1(t) + μ Y P2(t) + μX P3(t)

P’2(t) = λY P1(t) - (λX + μY)P2(t) + μX P4(t)

P’3(t) = λX P1(t) - (λY + μX)P3(t) + μY P4(t)

P’4(t) = λX P2(t) + λY P3(t) - (μX+μY)P4(t)

Matrix form:

P’(t) = P(t) . Λ - (λX + λY) λ Y λX 0

μY - (λX + μY) 0 λX

μX 0 - (λY + μX) λY

0 μX μY - (μX + μY)

Λ =P (t) : state probability vector

Λ : Infinitesimal generator matrix (transition rate matrix)

3AA(t) = P1(t) + P2(t) + P3(t) = P (t) . 1 : summation vector=

1110

13A

Solution: P(t) = P(0) . exp(-Λt)


Reliability computation

λ X

λ Y

μ Y

μ X

λ X

λ Y

1

2

3

4P’c (t) = Pc(t) . Λcc

- (λX + λY) λ Y λX

μY - (λX + μY) 0

μX 0 - (λY + μX)

P’1(t) P’2(t) P’3(t) = P1(t) P2(t) P3(t)

Pc(t) : correct service states probability vector

Λcc : transition rate matrix — correct service states

3RR(t) = P1(t) + P2(t) + P3(t) = Pc(t) . 1 : summation vector=

111

13R

Solution: Pc(t) = P(0) . exp(-Λcct)


Generalization: m states

Transition rate matrix: Λ = [λjk]

State probability vector: P (t) = (P1(t) P2(t) … Pm(t)) P (t) = (Pc (t) Pi(t) )

� λjk j≠k : transition rate between states j and k (off-diagonal terms)

� λjj,= - λjk j≠k : diagonal terms k=1, k≠j

∑m

Λcc Λci

Λic Λii

Λ =

Correctservice

Incorrectservice

mc states

mi states

Correctservice

Incorrectservice

mc + mi = m

(P1(t) P2(t) … Pmc(t)) (Pmc+1(t) Pmc+2(t) … Pm(t))


Quantitative measures: summary

� A (t) = P(0) . exp (Λ t) . 1mc

� A = lim A(t) = Πc . 1mc

� Πc = (0 0 … 0 1j 0 … 0) . Λj . 1mc

� Λj = obtained from Λ by replacing jth column by “1”

� R(t) = Pc(0) . exp (Λcct) . 1mc

� MTTF = -Pc(0) . Λcc . 1mc

� MTTR = -Pi (0) . Λii . 1mi

� MUT =

� MDT =

Πc . 1mc

Πc . Λ ci . 1mi

MUT + MDT = MTBF =Πc . Λ ci . 1mi

1

Πi . 1mi

Πc . Λ ci . 1mi

mc "1", mi "0"=

1 . .10 . .0

1mcA

mc "1"= 1mc

1 . .

1

A

A

mi "1"= 1mi

1 . .

1

-1 A

-1

-1

t=∞


exp [Λ t ]: solution methodsexp [Λ t ]: solution methods

� Taylor series

� Numerical integration P’ (t) = P (t) . ΛΛΛΛ : Runge-Kutta

� Uniformization

� Laplace Transforms

� Analytical approximation

Problem: Model Stiffness

msec

Error & Fault handling

sec hour day month year

failures

maintenance

solicitation

Time to occurrence or duration of considered events


UniformizationUniformization� Also known as randomization method

� Transformation of the Continuous-time Markov chaininto a Discrete-time Markov chain

q= max |λjj|j

Q= Λ/q + I

P(t) = Σ Φ(n)exp (− q t). (q t)

n

i =0

∞

n! Φ(n+1) = Φ(n). Q

Φ(0) = P(0)

Truncation at n = min k: Σ Φ(n)exp (− q t). (q t)

n

i =0

k

n!> 1- ε

ε: acceptable error


Laplace transformsLaplace transforms

� P’ (t) = P (t) . Λ ⇒ s.P*(s) - P(0) = P*(s). Λ

⇒ P*(s) = P(0) (s I - Λ)- 1

� P (t) = P (0). exp.(Λ t) ⇒ L [exp.(Λ t) ] = (s. I - Λ)- 1

� R* (s) = Pc (0) . (s I - Λcc)- 1

. 1mc

� MTTF = ∫ R(τ) dτ = lim ∫ R(τ) dτ

� MTTF = lim s. L [ ∫ R(τ)dτ ]= lim s.( 1 .R*(s))

� MTTF = lim R*(s) = R*(0) = - Pa (0).Λcc . 1mc

0

∞

t = ∞

s = 0 s- 1 R

R

0

t

0t

s = 0

s = 0

L [f’(t)] = s f*(s) - f(0) L [f(t)] = f*(s) = exp(-st) f(t)dt∫0

∞L [ ∫ f (x) dx] = – f*(s)

0

t 1

s


Example: Two component system,1 repairman per component

Example: Two component system,1 repairman per component

� Availability

Λ =

- 2 λ 2 λ 0

μ - (λ + μ ) λ

0 2 μ - 2 μ

s I - Λ =s+2 λ -2 λ 0

-μ s+λ+μ -λ

0 -2 μ s+2 μdet. (s I - Λ ) = s (s+λ+μ) [s + 2(λ+μ)]

A * (s) = (1 0 0 ) . (s I - Λ)- 1

110

= s 2 + 3(λ+μ) s + 2μ(2λ+μ)

det (s I - Λ )

A (t) = μ (μ + 2λ)(μ + λ)2

exp (- (μ + λ)t)

steady-stateavailability : A λ « μ ⇒ negligible terms as soon as t » 1/μ

+2λ2

(μ + λ)2exp (- 2 (μ + λ)t)-

λ2

(μ + λ)2

2 λ

μ 2 μ

λ

31 2


Example: Two component system, 1 repairman per component


� Direct computation of steady-state availability

Λ2 =

- 2 λ 1 0

μ 1 λ

0 1 - 2 μ

A = ( 0 1 0 ) . Λ2 . -1

A1mc = ( 1 1 0 ) det (Λ2) = 2(μ + λ)

= μ (μ + 2λ)(μ + λ) 2

2

Λaa =- 2 λ 2 λ

μ - (λ + μ )s I - Λaa =

s+2 λ -2 λ

-μ s+λ+μ

det (s I - Λaa ) = s2 + (3λ+μ)s + 2λ2 s1,2 = - (μ+3λ) ± √μ2 + 6μλ+λ2

2

R* (s)= s + μ + 3λ

(s - s1) (s - s2 ) R (t)=

s1 + μ + 3λ

s1 - s2

exp(s1 t) - s2 + μ + 3λ

s1 - s2

exp(s2 t)

s1 ≈ μ

-2λ2

s2 ≈ -μ λ « μ ⇒ R(t) ≈ (1 + 2 ) exp( - t) - 2 λ2

μ2 μ2λ2 λ2

μ2 exp(- μ t) ≈ exp( - t)μ2λ2

� Reliability

110




� Mean times

MTTF = - (1 0) . Λcc -1 1

1MTTF =

μ + 3λ

2 λ2≈

μ

2 λ2

MUT = μ + 2λ

2 λ2≈

μ

2 λ2

MDT = 1 2 μ

MTBF = MUT + MDT = 2 μλ2(μ + λ)2

λeq ≈ μ

2λ2

� Conclusion� the system behaves as a single component with a constant

failure rate λeq


λeq

1MTTF ≈

R(t) ≈ exp(- λeq t)

Analytical approximationAnalytical approximation� Goal: obtain an approximate expression of the system failure

rate directly from the Markov chain

� Conditions:

� Λcc irreducible ⇒� λi « μi

Correct service delivery statescorrespond to a strongly connected graph

The system behaves as asingle component with aglobal (equivalent) failurerate λeq

λi Correct servicedelivery states

1 2

Correctservice

Incorrectservice

λeq

Approximation(identification of critical paths : from

initial stats to a failed state

1 2

μi

λj

Incorrectservicedelivery states


Approximate expressionsApproximate expressions

� Failure rate

λeq ≈ Σ Π Transition rates of the path

Paths from initialstate to system

failure state

Π | λii | i ∈ path

i≠ initial state, i≠ failure state

approximation valid if:λi « μj

A ≈ Σ Π Transition rates of the path

Π | λii | approximation valid if:λi « μj

� Steady-state unavailability

Paths from initialstate to system

failure statei ∈ path

i≠ initial state


ExamplesExamples� Example 1

� Example 2

2 λ

μ 2 μ

λ

31 2

1 path : 1 ➙ 2 ➙ 3

λeq ≈ 2λ . λ≈ μ

2λ2

λ + μ A ≈

2μ

λeq

μ2

λ2

≈

paths: 1 ➙ 51 ➙ 2 ➙ 51 ➙ 2 ➙ 3➙ 51 ➙ 2 ➙ 3➙ 4

λ1 , λ2 « μ1 , μ2

3λ1 . 2λ2

2λ1 + 2λ2 + μ1

λeq ≈ 3λ2 + +3λ1 . 2λ1 . λ2

(2λ1+ 2λ2 + μ1).(2μ1 + λ1+ λ2)

+3λ1 . 2λ1 . λ1

(2λ1 + 2λ2 +μ1).(2μ1 + λ1 +λ2)5 4

System failure states

2 μ1

λ1

31 23 μ1

4

2 λ1

5

3λ2 2λ2 λ2

μ2

μ1

3 λ1


State agregationMotivation : reduction of the number of states

State agregationMotivation : reduction of the number of states

� Previous example—identical components : λX=λY= λ μX=μY=μ

P1(t + dt) = [1 - (λ dt + λ dt)] P1(t) + μ dt P2(t) + μ dt P3(t)

P2(t + dt) = λ dt P1(t) + [1 - (λ dt + μ dt)] P2(t) + μ dt P4(t)

P3(t + dt) = λ dt P1(t) + [1 - (λ dt + μ dt)] P3(t) + μ dt P4(t)

P4(t + dt) = λ dt P2(t) + λ dt P3(t) + [1 - (μ dt + μ dt)] P4(t)

P1(t + dt) = [1 - 2 λ dt] P1(t) + μ dt [P2(t) + P3(t)]

P2(t +dt) + P3(t +dt) = 2 λdt P1(t) + [1 - (λdt + μdt)] [P2(t) + P3(t)] + 2μ dt P4(t)

P4(t +dt) = λ dt [P2(t) + P3(t)] + [1 - 2 μ dt] P4(t)

⇔

2 λ

μ 2 μ

λ

2, 3 41

λ

λ

μ

μ

μ

λ

λ

μ1

3

4

2

Λ =- 2 λ 2 λ 0

μ - (λ + μ ) λ

0 2 μ - 2 μ


GeneralizationGeneralization� Motivation : state space reduction

� Exploit symmetries (identical components or behaviours)

� P = { P1,P2, …Ps } partition of state space E, s < m = |E|

λ i, Pj = Σ λ ik : rate of transition from state i to Pj

� Model reducible with respect to P if for each (Pi,Pj),λ k, Pj

are identical for each state k of Pi. These identicalvalues correspond to the rate associated to the transitionfrom Pi to Pj of the agregated Markov chain.

� System with n components, 2 states per composant

� without agregation → |E| ≥ 2 n

� identical components, with agregation → |E| ≥ n+1

k ∈ Pj


Illustration on the exampleIllustration on the example

⇔

2 λλλλ

μμμμ 2 μμμμ

λλλλ

P2 P3P1

P1, P2, P3

P = { 1, (2, 3), 4 }

λ 1, P2 = 2 λ λ 2, P1

= μ λ 2, P3 = λ λ 4, P2

= 2 μ

λ 3, P1 = μ λ 3, P3

= λ

λ k, Pj : rate of transition from state k to partition Pj : k Pj

λλλλ

λλλλ

μμμμ

μμμμ

μμμμ

λλλλ

λλλλ

μμμμ1

3

4

2


Markov reward models

� Useful for combined performance-availability evaluation(“performability”)

� Extension of continuous time Markov chains with rewards

� Reward: performance index, capacity, cost, etc.

� Quantitative measures

� ri = reward rate associated with state i of the Markov chain

� Z(t) = rX(t) : instantaneous reward rate of Markov chain X(t)

Y(t) = Z(x) .dx∫t

0

Expected instantaneous reward rate: E[Z(t)] = Σ ri . Pi (t)

Expected steady-state reward rate: lim E[Z(t)] = Σ ri . πi t=∞

� Y(t) = accumulated reward in [0, t]

E[Y(t)] = Σ ri . Pi (x). dx∫t

0


Modelling of complex systems� Model largeness

� Number ofcomponents,dependencies

� Stiffness

� parameters ondifferent time scales

� Largeness/stiffness tolerance andavoidance techniques

� Automatic generation of statespace using Stochastic Petri netsand their extensions

� Structured model compositionapproaches with explicitdescription of dependencies

� Hierarchical modeldecomposition and aggregation

� Robust model solution andefficient storage techniques

� Non exponentialdistributions

� Use of semi-Markov or nonMarkovian models

� Approximation by exponentialmodels using method of stage

� Use of simulation


Petri nets, SPNs and GSPNs� A Petri net (PN) is defined by (P, T, I, O, M)

� P: Places, represent conditions in the system

� T: Transitions, represent events

� I, O: Input, Output arcs connecting places andtransitions

� M: initial marking, number of tokens in each place

� Stochastic Petri nets (SPNs)

� PNs with exponentially distributed timed transitions

� GSPNs (Generalized Stochastic Petri Nets)

� PNs with exponentially distributed timed transitionsand instantaneous transitions

� Advantages� Suitable for describing concurrency, synchronization, …

� Reachability graph isomorphic to a Markov chain

p1

p2

p2

c 1-c

μ1 μ2

p1


Example 1: computer system with two failure andrestoration modes (imperfect detection)

μ1 μ2

P1: system OKP2: error activatedP3: nondetected errorP4: detected error

T1: fault activationT2: detectionT3: nondetectionT4: restorationT5: periodic maintenance

P2

c 1-c

P1

λ

P3P4

T1

T3T2

T5T4

M2

M3

T2

T1

1 0 0 0M1

0 0 0 1 M4 0 0 1 0

T3

T5T4

0 1 0 0

c λ

(1-c) λ

μ1

E1

μ2

E3 E4

Reachability graphGSPN

Markov chain


Example 2: N redundant component system,1 repairman per component

GSPN

0 1 k N

Nλ (N -1)λ λ

Nμμ

(N -k)λ

2μ (k+1)μN Up0 Down

N-1 Up1 Down

N-k Upk Down

0 UpN Down

Markov chain

P1

M(P1). λ

N

P2

M(P2). μ


Hardware- software with error propagation

H-ok

λh δh

H-e

h1-pph

μ

H-t

ζh

H-nd

h1-d

H-p

dh

H-u

hτ

H-fd

Hardware model

λs

S-ok

δ sν

S-e

sτ

S-fd

ds

sd1-

sp1-

S-nd

ζ pss

S-u

S-d

π s

S-ft

Software model

pph

Prop

1-pph

tp

tnPs

Error propagation model


Fault tolerant systems: coverageFault tolerant systems: coverage

� Fault tolerance efficiency

Coverage c = Prob.{ correct service delivery | active fault}

FunctionalProcessing

unit

Specified

Outputs

FP outputs

Accepted

Specified

FP outputs

Accepted

FaultTolerance

mechanisms

inputs

Detected errors

False alarms

Non detected errors

fault tolerancedeficiency


Fault tolerant systems modelling

exhaustion ofredundant resourcesdue to consecutive(covered) failures

Fault tolerant system failure

non coveredcomponent failures


Duplex system: Active dynamic redundancyDuplex system: Active dynamic redundancy

2 comp. non failed

1 comp. non failed 1 unit failed

2 comp. failed

1 2 3

Non covered failure

coveredfailure

r(t) : component reliabilityR(t) : system reliability

P1(t) = r2(t) P2(t) = 2. c. [ 1 - r(t)]. r(t)

R(t) = P1(t) + P2(t) = 2. c. r(t) - [ 2.c - 1].r2(t)

r(t)= exp(-λt) ➝ R(t) = 2. c. exp(-λt) - [ 2.c - 1].exp(-2λt)

MTTF = 2.c + 1

2λ<

32λ

1 2 32.λ.c

2.λ. (1-c)

λ

= λ.c+ λ.(1-c)


Duplex system: Passive dynamic redundancyDuplex system: Passive dynamic redundancynon covered failure of active comp.

orfailure spare

comp.

comp.failure

covered failureactive comp.

ra(t) : active component reliabilityrs(t) : spare component reliability

P1(t) = ra(t) . rs(t)

ra(t)= exp(-λt)

rs(t)= exp(-kλt)

P2(t) = c . [ rs(τ). d [1 - ra(τ)] . ra(t - τ) + ra(t). [ 1 - rs(t)]∫0

t

0 ≤ k ≤ 1 MTTF = 1 + c + k

<2λ(1 + k). λ

R(t) = (1 + ). exp(-λt) - .exp(-(1+k)λt)ck

ck

R(t) = ra(t) + c . [rs(τ). d [1 - ra(τ)] . ra(t - τ)] ∫t

0

1 2 3

λa .(1-c)

λa λa.c + λs

2 comp. non failed

1 comp. non failed 1 unit failed

2 comp. failed

1 2 3


Comparison Active - PassiveComparison Active - Passive

� short mission duration: λt « 1

� Active dynamic redundancy

R(t) ≈ 1 - 2.(1- c). λt - [ 1 - 3 (1- c)]. λ2 t2

� Passive dynamic redundancy

R(t) ≈ 1 - (1- c). λt + [ 1 - 2c - ck]. λ2 t2

� Reliability: passive redundancy > active redundancy

� Service interruption: active redundancy < passive redundancy

2

tradeoffs


IllustrationIllustration

10-6

10-5

10-4

10-3

10-2

10-1

1

0.001 0.01 0.1 1

λλλλt

1- R(t)

non-redondant

1

4

5

2

6

3c Active Passive

k = 0.1

1 1 40.99 2 50.95 3 6


Impact of coverage factor and repair rateImpact of coverage factor and repair rate

R(t) ≈ exp { -2 [ (1-c) + c λ/μ ] λ t }

λSYS = 2 [ (1-c) + c λ/μ ] λ

→

→

non covered failure

two consecutive failures,the first covered

: sensitive to (1- c) and λ/μ

2 c λ

λ

2 (1-c) λ

1 2 3μ

λ

μMTTR

MTTFCOMP

=

λ

λ MTTFSYS

=MTTF

COMPSYS

10-1

10-2

10-3

10-4

1

10-4

10-3

10-2

c = .95

c = .9

c = .99

c = .995

c = .999 c = 1

λ

λ

eq

2 c (λ/μ) λ

2 (1-c) λ

MTTF ≈2 [ (1-c) + c λ/μ ] λ

1


Tools

ADVISER, ARIES, CARE III, METFAC, SAVE, SURE, ASSIST, HARP, etc..

SURF-2

UltraSAN

SPNP Multi-formalism (SPNs, Stochastic Reward nets,NonMarkovian, fluid models)

Duke, USA

LAAS, France

UIUC, USAStochastic Activity Networks (SANs)

GSPNs, Markov

SHARPE Duke, USAMulti-formalism (Combinatorial , state-based)hierarchical models

Great-SPN GSPNs and stochastic well formed nets Torino, Italy

Möbius UIUC, USAMulti-formalism (SANs, PEPA, Fault tree,…

DSPNexpress Deterministic and stochastic Petri nets Dortmund, Germany

TimeNET Hamburg, GermanynonMarkovian SPNs

DEEM Deterministic and SPNs, Multi-phased systems UNIFI-PISA, Italy

DRAWNET++ Multi-formalism (Parametric Fault trees, SWN) U. del Piemonte orientale, U.Torino, U. Napoli, Italy


SURF-2SURF-2

GSPN processing Structural verifications Marking graph Markov chain

Generalized Stochastic Petri Net (GSPN)

Markov chain

USER

Markov chain processing Dependability, cost, performance measures : Reliability Availability Reward, …


SURF-2 measures

Transient

Relaibility

Safety

Maintainability

Availability

Unavailability

Reward

Transient

Relaibility

Safety

Maintainability

Availability

Unavailability

Reward

Steady-State

Availability

Unavailability

Reward

Steady-State

Availability

Unavailability

Reward

Means

MTFF

MTTF

MUT

MDT

MTBF

AC

Reward

Means

MTFF

MTTF

MUT

MDT

MTBF

AC

Reward


Models calibration and validation - data

� Estimation of model parameters based on data collected fromoperation, controlled experiments or using expert judgments

CentralSwitch

Controlled experiments

(testing, fault injection)data collection in operation

event logsfailure reports

(automatic/manual)

fault loadworkload

controllerlogfiles

Exp. measures

system under test with instrumentation

Failure modes

Fault coverageError detectionlatency

data processing and statistical analysis techniques

Failure rates,Repair rates, Failure modes


Assessment based on operational data

Raw data

Data pre-processing

Statisticalanalysis andmodelling

measurestrends

(data clustering algorithms)

• Failure/error distributions• Failure/error recovery rates Availability estimation• Impact of workload• Error propagation analysis

• Extraction of relevant information• identification/categorization of errors, failures, reboots• occurrence times, durations

• event logs• failure reports

time

CPU CPU CPU CPU errordisk disk Disk errorSW error

Normalactivity event

Reboot


Measurement-based studies

� Scope� Hardware

� Software and operating systems

� Distributed systems and middleware

� Internet, web-servers

� Human-computer interaction

� Security

� Wireless, mobile phones

� etc.

� Systems� Fault tolerant: FTMP, SIFT, TANDEM, VAX

� Desktop systems and servers: SUNOS/Solaris, Windows NT/2K,Linux

� Symbian OS..


Examples (1)� DEC VAXCluster Multicomputer

� 7 processing nodes et 4 disk controllers connected through a bus

� 8 months (december 1987 — August 1988)

MTTF λ MTTR μ coverage

CPU 8400 h 1.19 10-4 /h 24.8 min 2.42 /h 0.970

Disk 656 h 1.52 10-3 /h 110 min 0.54 /h 0.997

Network 1400 h 7.14 10-4 /h 53.4 1.12 /h 0.991

Software 677 h 1.48 10-3 /h 24.4 2.46 /h 0.1

� CMU Andrew file server

� 13 SUN II workstations - collection period: 21 workstation.year

Permanent failures 6552 h 29

Intermittent faults 58 h 610

Transient faults 354 h 446

System crashes 689 h 298

Mean time to occurrence(per system)

Number of events(all systems)


Assessment based on operational data (2)� LAAS-CNRS local area network

� 418 SunOS/Solaris, 78 Windows NT, 130 Windows 2K

� Jan. 1999-Oct. 2003: 1392 system.year - 50 000 reboots

Unix Windows NT Windows 2K

Reboot rate

10-3/hour

3.9 10-3/hour 5.3 10-3/hour

MeanUptime

12.8 day

1.5 day 1.08 day

MeanDowntime

180 min

30 min 35 min

AverageUnavailability

4.4 day/year

11.6 day/year

5.7 day/year


Fault tolerance efficiency assessment

Fault tolerance coverageC = Prob.{correct service delivery/active fault}

Error and Faulthandling Coverage

AssumptionCoverage

Fault Error failure

Fault dormancy

error latency


Experimental assessment

� Fault injection target

� HW, drivers, OS, API, middleware,application

� Fault model� Bit-flips (data, code segments, parameters)

� instruction mutation, dropping messages, …

fault loadworkload

controllerlogfiles

Exp. measures

FT system under test with instrumentation


Fault injection techniques

Simulation Model Prototype/real system

Simulation based

Software implemented

Physical

Heavy ions (Chalmers U.)EM perturbations (TU Vienna)pins (MESSALINE)…

Node (ORCHESTRA)OS (Ballista, MAFALDA)Mem. (DITA)Instr. (FERRARI)Processor (Xception)…

System (DEPEND)RT Level (ASPHALT)Logical gate (Zycad)Circuit (FOCUS)...wide range (MEFISTO)


Delta-4 projectSpare node

NAC : Network Attachment ControllerAMp : Atomic Multicast protocol

D Predicate:Auto-extraction of faulty node

T Predicate:Protocol Properties OKand errors confined

Target SystemTarget System

NAC/AMp

Host HostHost Host

NAC/AMp NAC/AMpNAC/AMp

1) Standard2) Duplex architecture1) Standard2) Duplex architecture

Successive versions of AMp

Successive versions of AMp

Physical injection

(MESSALINE)

0%

20%

40%

60%

80%

10ms 100ms 1s 10s 100s

NAC"duplex"

NACstandard

100%D Pred. T Pred.

nonsignificantexperiments

F E TD

Failure

94% 86% 99%

0.5%6% 1%

13.5%

Non detectedbut tolerated

detected butnot tolerated

J. Arlat, M. Aguera, Y. Crouzet, J-C. Fabre, E. Martins, D. Powell, Experimental Evaluation of the FaultTolerance of an Atomic Multicast Protocol”, IEEE Transactions on Reliability, vol. 39, n°4, 1990


POSIX OS robustness testing (BALLISTA)

Normalized failure rate (%)

15 « (C)OTS » OSs[Koopman & DeVale 99 (FTCS-29)]

15 « (C)OTS » OSs[Koopman & DeVale 99 (FTCS-29)]

�

�

�

��

�

0% 10% 20% 30% 40% 50%

SunOS 5.5

SunOS 4.13

QNX 4.24

QNX 4.22

OSF-1 4.0

OSF-1 3.2

NetBSD

Lynx

Linux

Irix 6.2

Irix 5.3

HPUX 10.20

HPUX 9.05

FreeBSD

AIX

Abort % Silent % Restart %

�Catastrophic

AIX

FreeBSD

HP-UX B.10.20

Linux

LynxOS

QNX 4.24

SunOS 5.5

NetBSD

Irix 6.2

Irix 5.3

HP-UX B.9.05

OSF-1 3.2

OSF-1 4.0

QNX 4.22

SunOS 4.13

Abort

Silent

Restart

Catastrophic

0 10 20 30 40 50

9

POSIX System call parameter mutation (233 functions)

P. Koopman, J. DeVale, “The exception handling effectiveness of POSIX Operating Systems”, IEEETransactions On Software Engineering, vol. 26, n°9, 2000


Dependability benchmarking� “Standardised” framework for evaluating dependability and

performance related measures experimentally or based onexperimentation and modeling

� Characterize objectively system behavior in presence of faults

� Non-ambiguous comparison of alternative solutions

� Non-ambiguity, confidence, acceptability ensured by a set ofproperties:

� Representativeness, Reproducibility, Repeatability, Portability,Non-intrusiveness, Scalability, Cost effectiveness

� Benchmark = specification of a set of elements (dimensions)and a set of procedures for running experiments on thebenchmark target to obtain dependability measures

� Dbench IST project (www.laas.fr/dbench)

� SIGDeb: Special Interest Group on DependabilityBenchmarking (IFIP 10.4 WG)


Case StudiesApplication to real systems


ExamplesSystem

CAUTRA: Air Traffic Control Computing system

Multiprocessor system with a distributed shared memory

Partners

Nuclear control command system

CENA, SRTI SystemCENA, Sofreavia

Bull

EDF

Automobile command control safety architectures:safety bag, electrical steering system

Siemens

Guards: Generic Upgradable architectures for real-timedependable systems: railways, space, nuclear

AnsaldoTrasportiAstrium,Technicatome

Control command system - Power station

TELECOM1 Satellite Control System CNES

EDF

Internet-based applications, mobile based architectures,interdependent critical infrastructures

DSoS, Hidenets, Crutialeuropean projects


Case study:

Availability Modeling and Evaluation of Telecom1 Satellite Control System


The results presented in these slides correspond to a preliminary (high-level) studyperformed for CNES, the French Space Agency.

They are intended to show the kind of results that can be obtained from modeling real-life systems.


System functions

cannot be interrupted more

than 15 min

⎭⎬⎫

can be delayed

T 1 : Follow-up of operational satellite

C1, DS1, P1 , (+DA1, for system reconfiguration)

T 2 : Follow-up of spare satellite

C2, DS2, P2 , (+DA2, for system reconfiguration)

T3 : Data archiving (off-line)

CS, DSS, PS C1, C2 (computers 1 & 2)

CS (“Spare” computer)

DS1, DS2, DSS (system disks)

DA1, DA2 (archiving disks)

P1, P2, PS (external devices)


System architecture

T 1

T3

T2

C1 CS C2

P1

DS1

DSS

DA2

DS2

DA1P2

PS


C1 CS C2

P1

DS1

DSS

DA2

DS2

DA1P2

PS

Automatic reconfiguration

Failure of CS, DSS or PS: data archiving delayed

Failure of an archiving disk (DA1, DA2)

Or of a system disk (DS1, DS2): tolerated

Failure of C1 or P1: T1 transferred to CS,

T3 delayed → utilization of DA1

Failure C2 or P2: T2 transferred to CS,

T3 delayed → utilization of DA2


Manual reconfiguration

System failure = loss of T1 or T2 or both

Failure of 2 elements : 2 computers, 2 external devices or 2 system disks

Failure of (DA1 followed by C1) / DS1 OK

or Failure (DA2 followed by C2) / DS2 OK

Failure of (DA1 followed by P1) / DS1 OK

or Failure of (DA2 followed by P2) / DS2 OK

transfer of DS1 on DA1 or DSS

(resp. DS2 on DA2 or DSS) by operator

maximum duration : 15 min

C1 CS C2

P1

DS1

DSS

DA2

DS2

DA1P2

PSOperator


Partition of states:E1 : T1 et T2 executed

E11 : No failures

E12 : Faults have been tolerated automatically

E13 : Faults have been tolerated after intervention of the operator

E2 : T1 or T2 is interrupted

E21 : Temporary failure

Faults can be tolerated thanks to the operator

E22 : Lasting (permanent) failures

maintenance is required

System states


Measures of dependabilityTwo dependability levels:

L1(t):

Probability of correct accomplishment of T1 and T2 without operator intervention

= P {e(τ) ∈ E11∪E12 ; τ ∈ [0, 1]}

L2(t):

Probability of correct accomplishment of T1 and T2 with or without operatorintervention

Availability:= P { e(τ) ∈ E1 }

E11∪E12 E13∪E2

E1 E21 E22

E1 E2

= P {e(τ) ∈ E11∪E12 ∪E13 | e(τ) ∈ E11∪E12 ∪E13 ∪E21; τ ∈ [0, 1]}


Modeling assumptions

⎭⎪⎬⎪⎫

two types of repairman

Repair priority

C1, C2 (computers 1 & 2)

CS (“Spare” computer)

DS1, DS2, DSS (system disks)

DA1, DA2 (archiving disks)

P1, P2, PS (external devices)

Recovery factor: c = 1

Two types of disk failures:

• Electronic failures

• Electro-mechanic failures


Numerical values

Failure rate (10-6 /h)

Element Electronic Electro-mechanic

Computers: C1,C2 1524

“Spare” computer: CS 1630 555

System disks: DS1, DS2, DSS 102 333

Archiving disks: DA1, DA2 372 333

External devises: P1, P2 386

External devise: PS 132

Repair duration:

Electronic : 1/μe = 4 h

Electro-mechanic : 1/μm = 10 h


Markov Model (All transitions are not

displayed)

20

28

22

21

1

19

18

16

15

14

13

12

11

7

3

6

5

2

9

8

10

μop

(λDS1+ λDS2)m

(λDS1+ λDS2)e

(λDA1+ λDA2)m

(λDA1+ λDA2)e

λP1+ λP2

λPS

(λDSS+ λCS)m

(λDSS+ λCS)e

μm

μe

μop

μop

μop

μe

μm

μe

μe

27

26

25

29

24

23

λDAie

λDAim

λDAie

λDAim

λCi

λPi

μm

μe

μe

μe

λC1+ λC2

λC1+ λC2

λP1+ λP2

λC1+ λC2

λP1+ λP2

μm λDSim

μm λDSim

E11 U E12

E13

E21

E22

174

λCi


Some results

1/μm (h) 10 10 10 10 5 5

1/μe (h) 4 4 2 2 4 4

(min) 15 5 15 5 15 5

Unavailability 6.34 10-4

6.32 10-4

2.97 10-4

2.95 10-4 4.70 10-4 4.6910-4

Unavailability(/year)

5h 33 min 5h 32 min 2h 36 min 2h 35 min 4h 07 min 4h 06 min

MDT (h) 4h 13 min 4h 12 min 2h 55 min 2h 53 min 3h 44 min 3h 43 min

MTBF (h) 6637 6637 9772 9772 7923 7923

1/μop


Case study 2:

Dependability modelling of theCAUTRA

(French Air Traffic Control Computing System)

K. Kanoun, M. Borrel, T. Morteveille and A. Peytavin, “Modeling the Dependability ofCAUTRA, a Subset of the French Air Traffic Control System”, IEEE Transactions onComputers, 48 (5), pp.528-535, 1999


Objective of the study

� Comparative analysis and evaluation from the dependabilitypoint of view of candidate architectures of the CAUTRA

� Hardware and software failures and their propagation

� Various fault tolerance and maintenance strategies

RENAR CCR(s)Regional Control Center

FPP

RDPRadar data

processing system

CAUTRA

STIP

Initial Flight Plans processing system

Flight Plansprocessing system

CESNACCentralizedexploitation

Center


Reference system

� Distributed architecture

� Hardware fault tolerance basedon dynamic redundancy� Duplex, 2 computers DG1 &

DG2

� Software fault tolerance usingexcepting handling andreplication� 2 replicas for each

application: pal, sec

� Communication system

� Organization

� Each computer hosts 1 pal and1 sec replicas (differentapplications)

DG1

FPPsec

DG2

FPPpal

RDPsec

RDPpal


Reconfiguration after a software failure

Failure of FPP pal

DG1

FPPsec

DG2

FPPpal

RDP RDPsec pal

DG1

FPPsec

DG2

FPPpal

RDP RDPsec pal

DG1

FPPpal

DG2

FPPsec

RDP RDPsec pal

DG1

FPPpal

DG2

FPPsec

RDP RDPsec pal

Switch forrecovery

Restart of FPP sec

Switch back forreconfiguration

O1

O2


Alternative architectures� Organization

� O1: FPPpal and RDPsec on one computer, FPPsec and RDPpal onthe other

� O2: FPPpal and RDPpal on one computer, FPPsec and RDPsec onthe other

� Software reconfiguration� R1:After switching the replica roles and restart of sec, the

replicas switch back to their initial roles

� R2:After switching the replica roles and restart of sec, thereplicas retain their new roles

� Hardware fault tolerance� S1: Duplex architecture (no additional spare)

� S2: Duplex + 1 spare, switch to spare after DG1or DG2 failure

� S3: Duplex + 1 spare, switch to spare after DG1and DG2 failure

12 possible architectures: A.o.r.s


Modeling

� Constant transition rates - homogeneous Markov chains� Software and hardware failure rates

� Recovery, reconfiguration and maintenance rates

� Modeling of interactions and error propagations betweenhardware and software components

� Model construction

� Generalized Stochastic Petri Nets (GSPNs)� Modular and compositional approach with detailed

description of hardware and software components behaviors� Structural verification

� Model Validation

� Progressive approach to master complexity


Modeling approach

� Component nets� Components behaviour as resulting from their own

permanent/transient faults

� Local fault tolerance strategies: error detection & handling

� Maintenance: local repair and restart actions

� Dependency nets

� Dependencies: functional, structural, due to maintenanceor fault tolerance strategies� HW-SW organisation

� Communication between components

� Global fault tolerance and maintenance strategies

� Global model� Composition of component nets and dependency nets

� Generic nets ⇒ enhance reusability for several architectures


Dependencies� Hardware - Software

� an error due to hardware transient fault may propagate to thehosted software replicas

� an error due to hardware permanent fault in a computer leads tostopping the hosted software replicas

� Software - Software

� An error in a replica due to a permanent fault may propagate to thereplicas with which it dialogs

� Communication

� Following failures of the communication medium, the system has toswitch from organization O1 to O2 if it is in O1 (switch from O2 toO1 not allowed as long as communication is failed)

� Fault tolerance and maintenance procedures

� Software-Software: reconfiguration from sec to pal

� Hardware-Hardware: repairman and spare are shared

� Hardware-Software: global recovery strategy, actions order/priority


Component and dependency netsComponent nets

NDG1 NDG2 DG1 & DG2 computers

NFPP1 N FPP2 NRDP1 NRDP2 Software replicas

NC Communication medium

NProp Propagation of a HW error to the hosted replicas

N Stop Software stop after activation of a permanent HW fault

N' Prop propagation of a SW to a communicating SW replicas

NRecRDDNRecFPP RDP, FPP software reconfiguration from sec to pal

NRep HW repair

N Com Communication medium behaviour

NSynRDP NSynFPP Synchronization between HW and SW recovery actions

NStrat Global reconfiguration strategy

N R1 Switch back in the case of reconfiguration R1

Dependency nets


RDP Model

NRDP1N RDP2

N DG1 DG2N

NSynRDP

N Rep

N RepMat

N RecRDP

N Stop PropN N Stop PropN

Hardware model

Software model


FPP model

NDG1 NDG2N Rep

NStop NProp N NStopProp

N’PropN’Prop

NSynFPP

NRecFPPNFPP1 FPP2N

Software model

Hardware model


Reference architecture Model: A.1.1.0

NFPP1

NpropNprop

NRecFPP

NSyncFPP

NRepNDG1

NPropNStop

NProp

NProp

NPropNStop

NRDP1

NSyncRDP

NRecFPP

NProp NStop

NFPP2

NDG2

NStopNProp

NRDP1

NCNReq NStrat NR1GlobalRecoveryStrategy

NProp

NProp

RDP model

FPP model


Other architectures models� A.2.1.0

� adapt NR1 & NSynFPP for O2

� A.1.2.0 et A.2.2.0

� remove NR1 & adapt NSynFPP for A.2.2.0

� R1,S1 et R1,S2 : A.1.1.1 - A.2.1.1 - A.1.1.2 - A.2.1.2

� adapt NR1 & NSynFPP for O2

� adapt Nrep

� add NSpare related to Nrep

� R2,S1 et R2,S2 : A.1.2.1 - A.2.2.1 - A.1.2.2 - A.2.2.2

� remove NR1

� adapt NSynFPP for O2

� adapt Nrep

� add NSpare related to Nrep


GSPN models

pph

Prop

1-pph

tp

tnPs

Hardware model Software model

H-ok

λh δh

H-e

h1-pph

μ

H-t

ζh

H-nd

h1-d

H-p

dh

H-u

hτ

H-fd

λs

S-ok

δ sν

S-e

sτ

S-fd

ds

sd1-

sp1-

S-nd

ζ pss

S-u

S-d

πs

S-ft

Error propagation model

NHard1

N Soft1

PropN

N Soft1

NHard1

PropN’

NDG2NDG1 NFPP NRDP


Architecture Models

� Architectures without the “spare” computer

� 104 states (after elimination of fast states)1/τh, 1/δh, 1/τs, 1/πs, 1/δs : a few seconds

� Architectures without the “spare” computer

� 390 states (after elimination of fast states)

� Parameters numerical values

λh λRDP (λs) λFPP (λs) λC 1 / μ 1 / νRDP 1 / νFPP 1 / βRDP 1 / βFPP 1 / βh

0.01 / h 0.01 / h 0.05 / h 1E-5 10 h 1 mn 10 mn 1 s 1 mn 1 mn

ph ps pph pps c ch

0.02 0.02 0.85 0.7 0.98 0.95


Some results: Impact of the organizationRDP Unavailability - S0

λ c0,00E+00

2,00E-05

3,96E-05

6,00E-05

8,00E-05

1,04E-04

1,20E-04

1E-8

4,61E-05

A.1.1.0

A.1.2.0 / A.2.2.0

A.2.1.0

UA

1E-7 1E-6 1E-5 1E-4

A.x.2.0

21 mn 01 s

21 mn 01 s

21 mn 11 s

22 mn 11 s

24 mn 14s

25 mn 04 s

A.1.1.0 A.2.1.0

21 mn 17s 20mn 49 s

21 mn 17s 20mn 49 s

21 mn 36s 20mn 49 s

24 mn 39s 20mn 49 s

54 mn 40s 20mn 49 s

5 h 55mn 20 mn 55 s

cλ

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

2,84E-068,93E-06

4,00E-05

6,39E-05

8,00E-05

1,00E-04

1,20E-04

A.1.1.1

A.1.2.1 / A.2.2.1

A.2.1.1

UA

λc

1E-8 1E-7 1E-6 1E-5 1E-4

A.x.2.1

1 mn 27 s

1 mn 28 s

1 mn 36 s

2 mn 35 s

4 mn 42 s

5 mn 31 s

A.1.1.1 A.2.1.1

1 mn 29 s 1mn 25 s

1 mn 31 s 1 mn 25 s

1 mn 48 s 1 mn 25 s

4 mn 42 s 1mn 25 s

33 mn 35 s 1 mn 30 s

5 h 20 mn 2 mn 04 s

cλ

1E-8

1E-7

1E-6

1E-5

1E-4

1E-3

RDP Unavailability - S1

Outage duration per year Outage duration per year


Impact of the organizationRDP Unavailability - S2

Outage duration per year

1E-4

1,20E-04

0,00E+00

2,00E-05

3,63E-05

6,00E-05

8,00E-05

1,00E-04

3,00E-05

A.1.1.2

A.1.2.2 / A.2.2.2

A.2.1.2

UA

λ c

1E-8 1E-7 1E-6 1E-5

1E-8

A.1.1.2 A.2.1.2

15 mn 59s 15 mn 40s

15 mn 59s 15 mn 40s

16mn 18s 15 mn 40s

19 mn 17s 15 mn 40s

49 mn06s 15 mn 46s

5 h 45mn 16 mn 21s

A.x.2.2

15 mn 46s

15 mn 46

15mn 56s

16mn 55s

18 mn 58s

19mn 49s

cλ

1E-7

1E-6

1E-5

1E-4

1E-3


Reconfiguration impact

A.1.2.z A.2.2.z A.1.1.z A.2.1.z

z = 0 2 h 42 mn 2 h 42 mn 2 h 42 mn 2 h 41 mn

z = 2 2 h 36 mn 2 h 36 mn 2 h 36 mn 2 h 36 mn

z = 1 2 h 17 mn 2 h 17 mn 2 h 17 mn 2 h 17 mn

A.1.2.z A.2.2.z A.1.1.z A.2.1.z

z = 0 21 mn 11s 21 mn 11s 21 mn 36s 20 mn 49s

z = 2 15 mn 56s 15 mn 56s 16 mn 18s 15 mn 40s

z = 1 1 mn 36s 1 mn 36s 1 mn 48s 1 mn 25s

RDP unavailability, per year

FPP unavailability, per year


Hardware coverage impact

RDP unavailability, per year — for S1 & S2

ch A.x.2.1 A.x.2.2 A.1.1.1 A.1.1.2 A.2.1.1 A.2.1.2

0.7 6 mn 7s 15 mn 49 s 6 mn 44 s 15 mn 59 s 6 mn 34 s 15 mn 37 s

0.9 2 mn 29 s 15 mn 49 s 2 mn 32s 15 mn 59 s 2 mn 27 s 15 mn 37 s

0.95 1 mn 27s 15 mn 46 s 1 mn 29 s 15 mn 59 s 1 mn 25 s 15 mn 37 s

1 25 s 15 mn 45 s 26 s 15 mn 59 s 23 s 15 mn 37 s


Tradeoff: repair duration, 3rd computer

1,00E-0.

1.00E-05

1.00E-04UA A.2.1.0 A.2.1.1

1/μ21 101/ μ ch A.2.1.1 A.2.1.0

1 h 0.5 2 mn 20 s 2 mn 16 s0.8 2 mn 18 s 2 mn 16 s

1.25 h 0.5 2 mn 51 s 2 mn 45 s0.8 2 mn 47 s 2 mn 45 s

2 h 0.5 4 mn 21 s 4 mn 20 s0.8 4 mn 07 s 4 mn 20 s

RDP unavailability, per year — for S1 & S2 vs repair duration(ch = 0.5)


Sensitivity analyses

Switching 1/β A.2.1.0 A.2.1.2 A.2.1.1

10 mn 18 h 39 mn 18 h 34 mn 18 h 02 mn

6 mn 11 h 33 mn 11 h 28 mn 11 h 02 mn

1 mn 2 h 41 mn 2 h 36 mn 2 h 17 mn

1 s 56 mn 46 s 50 mn 59 s 33 s

Restart 1/ ν A.2.1.0 A.2.1.1 A .2.1.2

1 h 6 h 20 mn 5 h 34 mn 6 h 10 mn

10 mn 2 h 41 mn 2 h17 mn 2 h 36 mn

6 mn 2 h 27 mn 2 h 05 mn 2 h 22 mn

1 mn 2 h 10 mn 1 h 50 mn 2 h 05 mn

FPP unavailability (per year) vs pal/sec switching duration

FPP unavailability (per year) vs replica restart duration


Conclusion

� CAUTRA

� Only architectures S1 satisfy the requirement of less than5 mn unavailability per year

� Best availability for A.2.2.1

� Modeling approach

� Modular

� Generic and reusable nets

� Progressive construction and validation� better confidence in the model

� Analysis of the impact CAUTRA failures on air trafficsafety� N. Fota, M. Kaâniche, K. Kanoun, “Dependability Evaluation of an Air Traffic Control Computing

System,” 3rd IEEE International Computer Performance & Dependability Symposium (IPDS-98), (Durham, NC, USA), pp. 206-215, IEEE Computer Society Press, 1998. Published in..Performance Evaluation, Elsvier, 35(3-4), pp.253-73, 1999


Evaluation with regard toMalicious Threats


Security evaluation: Context

� Historically, attention has been mainly focused on preventionand protection approaches, and less on evaluation

� Traditional evaluation methods

� Qualitative assessment criteria (verification&evaluation)� TCSEC (USA), ITSEC (Europe), Common Criteria� Security levels based on functional and assurance criteria

� Risk assessment methods� Subjective evaluation of vulnerabilities, threats and

consequences

� Red teams: try to penetrate or compromise the system

� Not well suited to take into account the dynamic evolutionof systems, their environment and threats duringoperation, and to support objective design decisions

� Need for security quantification approaches similar to thoseused in dependability relative to accidental faults


Security quantification challenges

� Defining representative measures

� Are new measures needed?

� Modeling attackers behaviors and system vulnerabilitiesand assessing their impact on security properties

� How different is it, compared to modeling accidental faultsand their consequences?

� Elaborating representative assumptions

� Continuous evolution of threats and attackers behaviors

� Need for unbiased and detailed data

Measures Models Data


Measures and Models� Feasibility of a probabilistic security quantification explored

early in the 1990’s (PDCS and DeVa projects)

� Measure = effort needed for a potential attacker to defeat thesecurity policy [City U.]

� Preliminary experiments using tiger teams [Chalmers U.]

� A “white-box” approach for modeling system vulnerabilities andquantifying security, using “privilege graph” [LAAS-CNRS]

� Graph-based models for the description of attack scenarios

� Attack graphs, attack trees, etc.

� Stochastic state-based models to assess intrusion tolerant syst.

� DPASA “Designing Protection and Adaptation into a Survivable Arch.”

� SITAR Intrusion Tolerant System [Duke, MCNC]

� Epidemiological malware propagation models

� Complex network theory, game theory, etc.


LAAS quantitative evaluation approach� Motivation

� Take into account security/usability trade-offs

� Monitor security evolutions according to configuration anduse changes

� Identify the best security improvement for the leastusability change

� Probabilistic modeling framework

� Vulnerabilities

� Attackers

� Measure = effort needed for a potential attacker to defeat thesecurity policy

� Application to Unix-based systems

R. Ortalo, Y. Deswarte and M. Kaâniche, “Experimenting with Quantitative Evaluation Tools forMonitoring Operational Security”, IEEE Transactions on Software Engineering, 25 (5), pp.633-650, 1999


Overview

Node = set of privileges

Arc = vulnerability class

path = sequence of vulnerabilitiesthat could be exploited by an attackerto defeat a security objective

weight = for each arc, effort toexploit the vulnerability

Vulnerabilities Modelingprivilege graph

Generation of attack scenarios

Measures

Assumption LMLocal Memory

METF: Mean-Effort to Security Failure

Application toLAAS LAN

ESOPE tool

Assumption TMTotal Memory

BP1

objective

C

F

12

4

5

6

7

3

intruder

3

3

6 5

7

4

12

1

1

1

1

33

2

2

33 3

3 3 3

3

3

3 6

6

66666

5

5557

6

4 4

4

6 6

0,1

1

10

102

103

06/04 08/04 09/04 11/04 12/04 02/05 04/96 05/05 07/05

Date

METF-SPMETF-LM

METF-TM # paths


Open issues

� Is the model valid in the real world?

� TM and ML are two extreme behaviors, but what wouldbe a “real” attacker behavior?

�Weight parameters are assessed arbitrarily(subjective?)

� Tenacity? Collusion? Attack rates?

� Validation based on real data would be needed!!


Data

� Measurement

� monitoring and collection of real attack data to learnabout malicious activities on the Internet� Exploited vulnerabilities, attacks tools, propagations, etc.

� Data Sources

� Vulnerabilities databases� OSVDB (Open Source Vulnerability Database), osvdb.org� SECURITYFOCUS, securityfocus.com� BUGTRACK, bugtrack.net� CVE (Common Vulnerabilities and Exposures), cve.mitre.org

� Logs sharing� Firewalls, routers, intrusion detection tools,…� Dshield, dshield.org

� Honeypots deployed on the Internet� Dedicated systems to observe attacks


Honeypots

� “an information system resource whose value lies inunauthorized or illicit use of that resource”

L. Spitzner, Honeypots: Tracking hackers, Addison-Wesley, 2002

� High-interaction honeypots� Emulate the existence of a target at different abstraction levels

(networks services, OS, application) without possibility to controlthe system

� Low risk, easy to implement

� High-interaction honeypots� A real or virtual system that can be compromised by the

attacker

� Higher risk, difficult to deploy


Data collection platforms

� Honeynet project

� Architectures and data collection tools

� Use honeypots

� CAIDA, caida.org

� Use special probes to monitor unused address spaces:telescopes

� Leurré.com

� Deploy on the Internet a large number of identicallyconfigured low-interaction honeypots at diverse locations

� Carry out analyses based on collected data to betterunderstand threats and build models to characterizeattacks

� Set-up by Institut Eurecom, Sophia-Antipolis


Leurré.com

80 honeypots - 30 countries - 5 continents

Reverse

Firewall

Internet

Windows 98workstation

Windows NT(ftp+web server)

Redhat 7.3 (ftp server)

Observer (tcpdump)

Virtual

Switch


Win-Win partnership

� The interested partner provides:

� One old PC (PentiumII, 128M RAM, 233 MHz, …)

� 4 routable IP addresses

� Eurecom offers

� Installation CD Rom

� Remote logs collection and integrity check

� Access to the whole SQL database

� Collaborative research projects using collected data

� CADHo: (Eurecom, LAAS, CERT-Renater)� Analysis and modeling of attack processes using low

interaction honeypots data� Development and deployment of high-interaction honeypots

to analyze behavior of attackers once they get access andcompromise a target


Overview of collected data� Data collection since 2003

� 3026962 different IP addresses from more than 100 countries� 80 honeypot platforms deployed progressively

Hon

eypo

t pla

tform

date

� Information extracted from the logs� Raw packets (entire frames including payloads)� IP address of attacking machine� Time of the attack and duration� Targeted virtual machines and ports� Geographic location of attacking machine (Maxmind, NetGeo)� Os of the attacking machine (p0f, ettercap, disco)


Analysis and modeling of attack processes

� Automatic data analyses to extract useful trends andidentify hidden phenomena from the data� Categorization of attacks according to their origin, attack tools,

target services and machines, etc.

� Analysis of similarities of attack patterns for different honeypotenvironments, etc.

� Publications available at: www.leurrecom.org/paper.htm

� Stochastic modeling of attack processes� Identify probability distributions that best characterize attack

occurrence and attack propagation processes

� Analyze whether data collected from different platforms exhibitsimilar or different malicious attack activities

� Predict occurrence of new attacks on a given platform based onpast observations on this platform and other platforms

� Publications available at: www.cadho.org


“Times between attacks” distribution� An attack is associated to an IP address

� occurrence time associated to the first time a packet is receivedfrom the corresponding address

� Best fit provided by a mixture distribution

0.000

0.005

0.010

0.015

0.020

0.025

1 31 61 91 121 151 181 211 241 271Time between attacks

pd

f

Pa = 0.0115k = 0.1183λ = 0.1364/sec.

Mixture (Pareto, Exp.)

Data

ExponentialPlatform 6

pdf t Pk

tP ea k a

t( )( )

( )=+

+ −+−

111 λ λ


“Times between attacks” distributions (2)

Platform 20 Platform 23

Platform 5 Platform 9

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08


pdf

DataMixture (Pareto, Exp.)

Exponential

Pa = 0.0019k = 0.1668λ = 0.276/sec.p-value = 0.99

0.00

0.00

0.00

0.01

0.01

0.01

0.01

0.01

0.02

1 31 61 91 121 151 181 211 241 271

Time between attacks

pdf

Data


Exponential

Pa = 0.0144k = 0.0183λ = 0.0136/sec.p-value = 0.90

0.00

0.01

0.02

0.03

0.04

0.05

0.06

1 31 61 91 121 151 181 211 241 271

Time between attacks

pd

fData


Exponential

Pa = 0.0031k = 0.1240λ = 0.275/sec.p-value = 0.985

Time (sec.)0.00

0.01

0.01

0.02

0.02

0.03

0.03


pdf

Pa = 0.0051k = 0.173λ = 0.121/sec.p-value = 0.90

DataMixture (Pareto, Exp.)

Exponential


Propagation of attacks� A Propagation is assumed to occur when an IP address of an

attacking machine observed at a given platform is observed atanother platform

P20

P6

P9

P5

P23

96.1%

0.9%

15.1%

43.2%

0.6%

2.7%

29%

4.1%

1.35%

8.1%

12.6%

54.1%

1.4%

1.37%

15.4%

95.5%

1%

0.6%

1.1%

11.3%

59%

3.7%

30.3%

4.3%

96.1%

95.5%

29%59%

15.1%

15.4%

11.3%


Discussion� Preliminary models to characterize attack processes

observed on low-interaction honeypots

� Several open issues� Need for predictive models that can be used to support

decision making during design and operation� How to assess the impact of attacks on the security of

target systems?

� Honeypots with higher degree of interaction are needed toanalyse attacker behavior once they manage to compromiseand access to a target

� first results demonstrate their usefulness and complementaritywith low interaction honeypots [Alata et al. 2006]

E. Alata, V. Nicomette, M. Kaâniche, M. Dacier and M. Herrb, “Lessons Learned from theDeployment of a High-Interaction Honeypot”, in Sixth European Dependable Computing Conference(EDCC-6), (Coimbra, Portugal), pp.39-44, IEEE Computer Society, 2006


High-interaction honeypot� analyze attackers behavior once they compromise and get

access to a target

Debian

Firewall

Internet

VMWare

Debian

Databasessh

Instrumentation and monitoring• modification of the virtual machines

�tty driver (Kernel space): read/write on the terminal� exec system call (Kernel space): program run by attacker� SSH modification (User space): login and passwords archival� new system call (Kernel space): backup information recorded from theuser into the kernel memory

• redirection mechanism at the firewall for output connections• tcpdump


Experimentation� Deployment : 5 Janv. 2006— 20 mars 2007 (419 days)

� Creation of 21 user accounts (Login, passwd)

� 655 different source addresses observed on the honeypot

� 552 362 ssh connections

� 98347 tested accounts (Login, passwd)

rank Nb connections Login password


Intrusion process

Account creation First password guessing First intrusion

timeΔ1 Δ2

4 days1 dayUA5

………

10 days5 daysUA4

1 day15 daysUA3

4 minutesHalf a dayUA2

4 days1 dayUA1

Δ2Δ1User account


Analysis of SSH connexions/attack sessions

552362connectionattempts

1940 sessions

210 sessions withcommands : intrusions

104 intrusions withouttyping errors

(?)

104 intrusions withtyping errors

(humans)

1730 sessions without commands

57 intrusionswithout typing

errors:block by block

47 intrusions withouttyping errors :

Character by character(humans)

1391 dictionary attacks other

Different IPaddresses


Dictionary attack - 1391

� A dictionary attack is a sequence of attempts to connect toservice by submitting login/password pair picked from adictionary

� Observed dictionary attacks:

� Dedicated IPs for dictionary attacks: never used for intrusionattacks

� Large dictionary

� Very short time intervals between successive connectionattempts

⇒ Automated scripts


Intrusion - 210

� An intrusion attack is an illegal use of local services thanks tothe illegal use of a valid user identity

� Observed intrusion attacks:

� Half of these IPs came from the same country

� Use of ssh service

� Dedicated IPs to intrusion attacks: never used for dictionaryattacks

⇒ No intersection between IPs used to perform dictionary attackand IPs used for real intrusion


Behavior of attackers (1)� Type of the attackers: humans or programs

� Identification criteria used:� Typing mistakes

� Typing speed

� Character by character input / block input

35 intrusions

25 intrusionswith mistakes

⇒ human

13 intrusionswithout mistakes

9 intrusions withCharacter by character input

⇒ human

4 intrusions withSingle character command

⇒ ?

⇒ Intrusions performed by human beings


Behavior of attackers (2)1. They change the password of the hacked account

2. Try to hide their activities

3. Try to download malware, 70% of the attackers were unable todownload their malware

4. Four main activities:• Launching ssh scans on other networks

• Launching irc clients

• Trying to become root

• Phishing activity (1 attempt)

Well known attack method (cooking recipes)

⇒ Not experienced hackers


Summary

knowledge baseof attacks?

IPs dedicated tointrusion attacks

High-interactionhoneypot

Internet

IPs dedicated todictionary attacks

1- Dictionary attackAutomated scripts

4- Intrusion attackHumans

2- Share information?3- Get information?

ssh + weakpasswords


Dependability evaluation: challenges and gaps

Evaluation wrtmalicious faults

Evaluation wrtaccidental faults

Unified approaches for resilience evaluation andbenchmarking of large systems and critical

infrastructures, combining analytical, simulation andexperimental techniques

complexity evolution of threats

interdependencies

dynamicity

socio, technical and economic dimensions

runtime assessment


References� General background

� K. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications », 2nd Edition, John Wiley and Sons, New York,(2001)

� R.W. Howard, Dynamic Probabilistic Systems, vol. I and vol. II John Wiley & Sons, 1971

� J.-C. Laprie, “Dependability Handbook“, CÉPADUES-ÉDITIONS, 1995 (in French)

� B. Haverkort, R. Marie, G. Rubino, K. Trivedi, Performability modeling: Techniques and tools”, John Wiley & Sons, ISBN 0-471-49195-0

� M. Ajmone Marsan, G. Balbo, G. Conte, S. Donatelli and G. Franceschinis, Modelling with Generalized Stochastic Petri Nets _Wiley Series inParallel Computing _John Wiley and Sons __ISBN: 0 471 93059 8 (http://www.di.unito.it/~greatspn/bookdownloadform.html)

� Dependability Modeling� J. Bechta-Dugan, S. J. Bavuso and M. A. Boyd, “Dynamic fault-tree models for fault-tolerant computer systems”, IEEE Transactions on Reliability,

41, pp.363-377, 1992

� C. Betous-Almeida and K. Kanoun, “Construction and Stepwise Refinement of Dependability Models”, Performance Evaluation, 56, pp.277-306,2004

� N. Fota, M. Kaâniche, K. Kanoun, “Dependability Evaluation of an Air Traffic Control Computing System,” 3rd IEEE International ComputerPerformance & Dependability Symposium (IPDS-98), (Durham, NC, USA), pp. 206-215, IEEE Computer Society Press, 1998. Published in..Performance Evaluation, Elsvier, 35(3-4), pp.253-73, 1999

� M. Kaâniche, K. Kanoun and M. Rabah, “Multi-level modelling approach for the availability assessment of e-business applications”, Software:Practice and Experience, 33 (14), pp.1323-1341, 2003

� K. Kanoun, M. Borrel, T. Morteveille and A. Peytavin, “Modeling the Dependability of CAUTRA, a Subset of the French Air Traffic ControlSystem”, IEEE Transactions on Computers, 48 (5), pp.528-535, 1999

� I. Mura and A. Bondavalli, “Markov Regenerative Stochastic Petri Nets to Model and Evaluate the Dependability of Phased Missions”, IEEETransactions on Computers, 50 (12), pp.1337-1351, 2001

� M. Rabah and K. Kanoun, “Performability evaluation of multipurpose multiprocessor systems: the "separation of concerns" approach”, IEEEtransactions on Computers, 52 (2), pp.223-236, 2003


References� Dependability Modeling (cntd)

� W. H. Sanders and J. F. Meyer, “Stochastic activity networks: Formal definitions and concepts”, in Lectures on Formal Methodsand Performance Analysis. Lecture Notes in Computer Science 2090, pp.315-343, Springer-Verlag, 2001

� M. Kaaniche, K. Kanoun, M. Martinello, “User-Perceived Availability of a web-based Travel Agency”, in IEEE InternationalConference on Dependable Systems and Networks (DSN-2003), Performance and Dependability Symposium, (San Francisco,USA), 2003, pp. 709-718.

� A. E. Rugina, K. Kanoun, M. Kaâniche, "A System Dependabiliy Modeling Framework using AADL and GSPNs," inArchitecting Dependable Systems IV, vol. 4615, LNCS, R. de Lemos, C. Gacek, and A. Romanovsky, Eds.: Springer-Verlag,2007, pp. 14-38.

� Experimental measurements and Benchmarking� P. Koopman, J. DeVale, “The exception handling effectiveness of POSIX Operating Systems”, IEEE Trans. On Softwrae

Engineering, vol. 26, n°9, 2000

� J. Arlat et al., Fault Injection for Dependability Evaluation: A Methodology and some applications”, IEEE Transactions onSoftware Engineering, vol. 16, n°2, 1990

� J. Carreira, H. Madeira, , J. G. Silva, “Xception: A Technique for the Evaluation of Dependability in Modern Computers”, IEEETransactions on Software Engineering, vol.24, n°2, 1998

� Eric Marsden, Jean-Charles Fabre, Jean Arlat: Dependability of CORBA Systems: Service Characterization by Fault Injection,SRDS-2002, pp. 276-85, 2002

� R. Chillarege et al. , “Orthogonal Defect Classification — A Concept for In-process Measurements”, IEEE Transactions onSoftware Engineering, vol.18, n°11, 1992.

� R. Iyer, Z. Kalbarczyck, “Measurement-based Analysis of System Dependability using Fault Injection and Field Failure Data”,Performance 2002, LNCS 2459, pp.290-317, 2002

� D. P. Siewiorek et al. “Reflections on Industry Trends and Experimental Research in Dependability”, IEEE transactions onDependable and Secure Computing, vol.1, n°2, April-June 2004.

� K. Kanoun, L. Spainhower (Eds.), Dependability Benchmarking for Computer Systems, John Wiley & Sons, Inc., IEEE CSPress, 2008, ISBN 978-0470-23055-8


References

� Security assessment� B. Littlewood, S. Brocklehurst, N. Fenton, P. Mellor, S. Page, D. Wright, J. Dobson, J. McDermid and D. Gollmann,

“Towards Operational Measures of Computer Security”, Journal of Computer Security, 2, pp.211-229, 1993

� M. Dacier, M. Kaâniche, Y. Deswarte, “A Framework for Security Assessment of Insecure Systems”, 1stYear Report of theESPRIT Basic Research Action 6362: Predictably Dependable Computing Systems (PDCS2), pp. 561-578 , September 1993,

� E. Jonsson and T. Olovsson, “A Quantitative Model of the Security Intrusion Process Based on Attacker Behavior”, IEEETransactions on Software Engineering, 23 (4), pp.235-245, April 1997

� M. Kaâniche, E. Alata, V. Nicomette, Y. Deswarte and M. Dacier, “Empirical Analysis and Statistical Modeling of AttackProcesses based on Honeypots”, in WEEDS 2006 - workshop on empirical evaluation of dependability and security (inconjunction with the international conference on dependable systems and networks, (DSN2006), pp.119-124, 2006.

� B. B. Madan, K. Goseva-Popstojanova, K. Vaidyanathan and K. Trivedi, “Modeling and Quantification of SecurityAttributes of Software Systems”, in IEEE International Conference on Dependable Systems and Networks (DSN 2002),(Washington, DC, USA), pp.505-514, IEEE computer Society, 2002

� D. M. Nicol, W. H. Sanders and K. S. Trivedi, “Model-based Evaluation: From Dependability to Security”, IEEETransactions on Dependable and Secure Computing,, 1 (1), pp.48-65, 2004.

� R. Ortalo, Y. Deswarte and M. Kaâniche, “Experimenting with Quantitative Evaluation Tools for Monitoring OperationalSecurity”, IEEE Transactions on Software Engineering, 25 (5), pp.633-650, 1999

� V. Gupta, V. V. Lam, H. V. Ramasamy, W. H. Sanders and S. Singh, “Dependability and Performance Evaluation ofIntrusion Tolerant-Server Architectures”, in First Latin-American Symposium on Dependable Computing (LADC 2003),(Sao-Paulo, Brazil), pp.81-101, IEEE Computer Society, 2003

� F. Pouget, M. Dacier, J. Zimmerman, A. Clark, G. Mohay, “Internet attack knowledge discovery via clusters and cliques ofattack traces”, Journal of Information Assurance and Security, Volume 1, Issue 1, March 2006 , pp 21-32

� E. Alata, V. Nicomette, M. Kaâniche, M. Dacier and M. Herrb, “Lessons Learned from the Deployment of a High-Interaction Honeypot”, in Sixth European Dependable Computing Conference (EDCC-6), (Coimbra, Portugal), pp.39-44,IEEE Computer Society, 2006

Documents

{mohamed.kaaniche, karama.kanoun computer-based ...resist.isti.cnr.it/files/corsi/courseware_slides/dependability_and... · Generic failure modes (IEC 812-1985) 1. Structural failure