Assessing the Impact of Imperfect Diagnosis on...

Preview:

Citation preview

AssessingtheImpactofImperfectDiagnosisonServiceReliability:AParsimoniousModelApproach

Networking and Security Group Aalborg University, Denmark ljg@es.aau.dk

European Dependable Computing Conference 2010 – Valencia, Spain April 28, 2010

<

(Presenter) Jesper Grønbæk Hans-Peter Schwefel Jens Kristian Kjærgård Thomas S. Toftegaard

Tieto IP Solutions, Denmark

Aarhus School of Engineering, University of Aarhus, Denmark

Forschungszentrum Telekommunikation Wien, Austria

April 28, 2010 EDCC 2010 – Valencia, Spain

2

• ConclusionsImperfectDiagnosis

  Networkfaultdiagnosis  Dependableend‐userserviceprovisioninginNextGenerationNetworkarchitectures

Dominatedbywirelessnetworks,mobilityandvaryingtrafficconditions  Challengedbyunreliableobservationsandhiddennetworkstates

  ImperfectDiagnosis

  Modellingimperfectdiagnosis  Goalsofmodelling

A.  DeterminebestremediationactionsB.  Determinebesttrade‐offofimperfections

  Assesspropertiesofagivendiagnosiscomponent(functionlevelmodelling[1],systemlevelsimulation[2])

  Light‐weightmodelsdesirableforfrequentmodelre‐evaluations

BackgroundandMotivation

April 28, 2010 EDCC 2010 – Valencia, Spain

3

• ConclusionsImperfectDiagnosis

  ODDRdecentralizedfaultmanagementframework[3][4](Observation,Diagnosis,DecisionandRemediation)  End‐nodeDrivenFaultManagement  Jointviewonimperfectdiagnosisanddecisions(remediation,observationcollection)  Operationindynamicenvironmentfrequentmodelre‐evaluations

Subsequentfocusontrade‐offofimperfections(bestdiagnosissettings)

Example:DecentalizedFaultManagementFramework

April 28, 2010 EDCC 2010 – Valencia, Spain

  Diagnosisatomicview  Singleobservation  Twonetworkstates(Normal/Fault)  Discretediagnosissteps(periodT)

  GenericDiagnosis(stateestimation)definitions

4

• ConclusionsBackgroundonDiagnosisApproachesDefinitionsofDiagnosisOutcomes

April 28, 2010 EDCC 2010 – Valencia, Spain

5

• ConclusionsBackgroundonDiagnosisApproachesDiagnosisClasses

1 Terminology adapted from [5]

2000 repetitions

  Twolevelsofcomplexityofdiagnosisbehaviour  One‐shot1:diagnosisestimatebasedonasinglesetofobservationsintime

  NocorrelationofdiagnosisestimatesfromdiagnosisSimplemodelrepresentationproposedin[3]

  Over‐time1:diagnosisestimatebasedonnewandoldobservations  Meanstoimprovediagnosisestimates  Strongcorrelationaddedbydiagnosiscomponent

  Comparison  One‐shot:thresholdonround‐triptime(RTT)  Over‐time:α‐countheuristic(Bondavallietal.[1])onone‐shotestimates  Transienteffectsfromnetworkneglected

  Over‐timehashighlytransientphase;yetsignificantimprovement  Identifybesttrade‐off:ReactionTime&FalseAlarms  Simpleparameterizationfromsteady‐statebehaviourisdifficult

April 28, 2010 EDCC 2010 – Valencia, Spain

  Four‐stateMarkovmodelpresentedin[3]  ControlledbygeometricON‐OFFnetworkstateprocess

(fault/repairoccurence){pf,pr}  2freeparameters{P(TN|Ns=Normal)=TNR=(1‐FPR),P(TP|Ns=Fault)=TPR=(1‐FNR)}

  Exploremodelcapabilities  Remediationassumption:fail‐overonnetworkfaultstatediagnosis  6freeparameters  fixed{pf,pr}4freeparameters

6

• ConclusionsParsimoniousDiagnosisModelDefinitionandParameters

SystemEquations

April 28, 2010 EDCC 2010 – Valencia, Spain

7

• ConclusionsParsimoniousDiagnosisModel

  DiagnosisMetrics  ProposedMetrics(steadystate)

  ProbabilityonRemediationonFalseAlarm,(pRFA)  MeanRemediationReactionTime(µRRT)

Note,twoparametersandfourfree

  DiagnosisTrace  Startdiagnosisinnormalnetworkstateforagivenset{pf,pr}  Observeuntilalarmisdiagnosed  PerformMrepetitionsandderiveO=#FA

  pRFA=O/M  µRRT,meantimetoremediationoverallM

DiagnosisMetricsDefinitions

April 28, 2010 EDCC 2010 – Valencia, Spain

8

• ConclusionsParsimoniousDiagnosisModel

  Closed‐formequationsderivedbylinearalgebraicapproaches[6]  ProbabilityonRemediationonFalseAlarm(pRFA)Probabilityofabsorption

  MeanRemediationReactionTime(µRRT)Meantimetoabsorption

  Solvingyieldstwolinearequations:

DiagnosisMetricsEquations

Absorbing states

Initial state

April 28, 2010 EDCC 2010 – Valencia, Spain

  Underdeterminedproblemsolvedbyheuristics(MI)MinimizepFPTNandpTPFN.MinimizedirecttransitionsTNFPandFNTP

  Behaviourintransientanalysis:  Initialstudyparameters:T=0.4s,Meannormalperiod=12.42s,Meanfaultperiod=15s

  CapturesaninitialhigherprobabilityofpRTAoverallalarms(pRTA+pRFA)

9

• ConclusionsParameterizationbyDiagnosisMetrics

minimize

minimize

pRFA

pRTA

pRTA

(pRFA + pRTA)

April 28, 2010 EDCC 2010 – Valencia, Spain

10

• ConclusionsCase:TimeConstrainedDataTransfer

  QoSrequirement:CompleteSCTPbasedfiletransferwithintdeadlinesecondswiththeprobability:Ω

  Fault:Congestioninoperatorinfrastructure(occurrenceandrepair,ON‐OFFmodel)

  Remediation:Singlefail‐overfromnetworkAtonetworkB

  Diagnosis:SimplethresholdbasedonRTTandα‐count  Decision:Fail‐overonnetworkfaultstatediagnosis

Background

April 28, 2010 EDCC 2010 – Valencia, Spain

11

• ConclusionsCase:TimeConstrainedDataTransfer

  PolicyEvaluationDiscreteTimeMarkovModel(PEDTMC)[3]  StateSpace:

SPE={Activenetwork,Timeprogress,Fileprogress,Networkstate,Diagnosisstate}

  Ωmodel=ΣSPEss(r,n)

PolicyEvaluationModel

File Transfer Completion Time CDF

r =1

m

April 28, 2010 EDCC 2010 – Valencia, Spain

12

• ConclusionsModelSensitivityAnalysis

  ModelbasedsensitivityanalysisonΩ  VaryµRTTandpRFA,tdeadline=30s&filesize=10MByte

  Comparetoperfectdiagnosisandno‐failoverpolicy

  BothmetricshaveaclearimpactonΩ,µRTTpromptnessandpRFA‐>correctness  MostsensitivetohighpRFAwrongfail‐overcannotberemediated  Candeliversignificantlyworseperformancethannofail‐over

Perfect Diagnosis

No fail-over

April 28, 2010 EDCC 2010 – Valencia, Spain

13

• ConclusionsReliabilityEvaluationResults

  Studypropertiesofα‐countdiagnosiscomponent  α‐countcontrolledbytwoparameters:kforgettingfactor,αTthreshold  PEDTMCModelbasedanalysis  Simulationbasedanalysis

  Systemlevelsimulationbasedonns‐2  ProvideevaluationofΩandtracesofdiagnosisperformance

  Considertwosettingsofone‐shotdiagnosis:

  Tradeoffoptionsofa‐count(obtainedfromsingletraceset,2000runs)

Background&Trade‐offResults

γ0=(TPR,TNR)=(0.983,0.097)γ1=(TPR,TNR)=(0.953,0.225)

April 28, 2010 EDCC 2010 – Valencia, Spain

14

• ConclusionsReliabilityEvaluationResults

  PEDTMCmodelbasedanalysis  Simplethreshold

  γ0performsbetterthanγ1(asshownin[3])

  α‐count  Overallleadstoimprovement

filteringoutfalsealarms  Optimalsettingsexist  γ1:k=0.92,aT=2.5leadstobestresults

ObtainablereductionofpRFAwithoutsimilarincreaseinµRTT

  Simulationbasedanalysis  Consistentconclusionstomodel  Qualitativedifferences

  stochastictimemodel

  Simplifieddata‐transfermodel

Background&Trade‐offResults

Ωsi

mul

atio

n Ω

mod

el

Threshold αT

Simple threshold Threshold αT

April 28, 2010 EDCC 2010 – Valencia, Spain

15

• ConclusionsConclusion&Outlook

  Conclusions  Proposedparsimoniousimperfectdiagnosismodelforlight‐weightassessmentof

bestdiagnosiscomponentsettings;alsoconsideringcomplexclassofover‐timediagnosiscomponents

  Definedrepresentativeimperfectdiagnosisperformancemetricsandderivedtheirclosed‐formequationsinthemodel

  Presentedservicereliabilitycaseandperformedmodelbasedsensitivityanalysisofreliabilityonimperfectdiagnosisperformancemetrics

  Usedmodeltoassessdiagnosisperformancepropertiesofover‐timediagnosisheuristicfromliteratureanddefinebestsetting

  Shownbysystemlevelsimulationanalysisthatdiagnosismodelcancaptureessentialimperfectdiagnosisperformancecharacteristics

  Outlook  Introducemorecomplexdecisionpolicies

  Applicationstateinformationminimizeremediation  Multiplefaultdiagnosis  DecisionstocollectmoreinformationNeedtostudydiagnosismodelbehaviourafterpositivediagnosisandpotentiallyextend

April 28, 2010 EDCC 2010 – Valencia, Spain DRCN 09 - Washington DC

16

• Conclusions

Questions&Discussion

April 28, 2010 EDCC 2010 – Valencia, Spain

17

References

[1] Threshold-based mechanisms to discriminate transient from intermittent faults. A. Bondavalli, S. Chiaradonna, F. Di Giandomenico, and F. Grandoni, IEEE Transactions on Computers, vol. 49, no. 3, pp. 230–245, 2000.

[2] Probabilistic Fault-Diagnosis in Mobile Networks Using Cross-Layer Observations. A. Nickelsen, J. Grønbæk, T. Renier, and H.-P. Schwefel, “” In Proceedings of AINA 09, pp. 225–232, 2009.

[3] Model based evaluation of policies for end-node driven fault recovery. J. Grønbæk, H.-P. Schwefel, and T. Toftegaard, Proc. DRCN 09, 2009.

[4] Towards self-adaptive reliable network services in highly-uncertain environments. A. Ceccarelli, J. Grønbæk, L. Montecchi, A. Bondavalli, and H. P. Schwefel, To appear in proceedings of WORNUS 10, May, 2010.

[5] Hidden Markov Models as a Support for Diagnosis: Formalization of the Problem and Synthesis of the Solution. A. Daidone, F. Di Giandomenico, S. Chiaradonna, and A. Bondavalli, in 25th IEEE Symposium on Reliable Distributed Systems, 2006. SRDS’06, 2006, pp. 245–256.

[6] Queueing Theory – A Linear Algebraic Approach. L. Lipsky, 2nd ed. Springer, 2009.

,,

Recommended