Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ENABLING AND (IN MY HUMBLE OPINION)ESSENTIAL TECHNOLOGIES FOR DEPENDABILITY
BENCHMARKING
Project funded in part by the NSF Next Generation Software Program, the NSF Information TechnologyResearch Program, and the Motorola Center for High-Availability System Validation
http://www.crhc.uiuc.edu/PERFORM39th IFIP Working Group Meeting, March 1, 2001
William H. Sanders
University of Illinois at [email protected]
Dependability Benchmarking (from Henrique Madeira)
Dependability benchmark (a working definition)
A test (or set of tests) to assess measures related to the behaviorof a computer system in the presence of faults (e.g., failuremodes, error detection coverage, error latency, diagnosisefficiency, recovery time, recovery losses, etc.), supporting theevaluation of dependability attributes (reliability, availability,safety).
Workload Systemunder test
Faultload
Measurements
Otherparameters
Models
DependabilityAttributes
Direct DependabilityBenchmark
Measures
Another View and Justification for a Combined Approach(adapted from slide by J. Arlat)
Specificationof ProperService
SystemUnderTest
Specification ofFaults/Errors to
Inject
ConditionalFault-Tolerance
Measures
Fault Injectionof Prototype
Specification ofActivity Demanded
of System
UnconditionalDependability
Measures
Modeling
Examples:• Coverage• Time to Recovery
Examples:• Availability• Reliability• Performability
What Technologies are Applicable?
Modeling
Simulation(Fault Injection onSimulated System)
Continuous State
Discrete Event (state)
Sequential Parallel
Analysis/Numerical
Deterministic Non-Deterministic
Probabilistic Non-Probabilistic
State-space-basedNon-State-space-based
(Combinatorial)
Possible Benchmarking Technologies
Measurement
Passive(no faultinjection)
Active(Fault Injectionon Prototype)
WithoutContact
WithContact
Hardware-Implemented
Software-Implemented
Stand-aloneSystems
Networks/Distributed
Systems
Comments (Claims)
• If reliability, availability, and safety are to be evaluated, combinedmeasurement (including fault injection) and modeling techniques mustbe employed
• Many measurement- and modeling-based techniques exist, but workstill remains to make them applicable to large-scale, distributedsystems.
• Effective techniques themselves are not sufficient -- Common andagreed upon ways to apply these technologies (which I call testcases) must be defined
– System-neutral representations of work- fault-load are needed
– Methods must be developed to inject these work and fault-loadsthat can be applied across multiple systems
– System-neutral model generation methods must be developed totranslate conditional into unconditional measures
• Development of dependability benchmarks is extremely difficult!
Current Projects on Fault Injection and Modeling(www.crhc.uiuc.edu/PERFORM)
Loki - Fault injection based on a measure-driven partial global viewof system state, time, and previously injected faults
Distributed system fault injector that permits fault injections andmeasure collections based on a partial global view of systemstate, yielding statistically sound estimates of distributed systemdependability
Möbius - Multi-faceted performance/dependability validationframework
Infrastructure for building domain independent/specificperformance dependability analysis tools which support multiplemodel specification, composition, connection, and solutionmethods
Loki: Experimental Validation of Distributed Systems
LAN2
LokiRuntime
SystemUnderTest
Communicationvia Probes
LokiRuntime
SystemUnderTest
Communicationvia Probes
LokiRuntime
SystemUnderTest
Communicationvia Probes
LokiRuntime
SystemUnderTest
Communicationvia Probes
LAN1
N2
N3N1
N1
N4
N4
N3
N3
• Targeted injection of faults in particular system states, defined by thestate of multiple system components
• Statistically-significant interpretation of experiment results yieldingcoverage, recovery times, and performance related measures
Loki Concepts, Architecture and Data Flow
. . .Loki RuntimeSystemUnderTest
Loki RuntimeSystemUnderTest
Loki RuntimeSystemUnderTest
Loki Runtime
SystemUnderTest
Probe
Inject Fault
StateMachine
Notifications
Recorder
StateMachineTransport
FaultParser
LAN1
LAN2
. . .
Local Timelines
Offline ClockSynchronization
Uses timestamp datacollected before andafter the experiment
Single Global Timeline
To LAN2
MeasureAnalysis
Measures3
7
11
1
N
2
N21
Loki State Machine SpecificationINIT
ELECT
FOLLOW LEAD
EXIT CRASH
ERROR
FOLLOWER
EXIT
INIT_DONE
LEADER
CRASH
EXIT
ERROR
State Machine for Each Node
STATE sylvester Persian heathcliff
INIT heathcliff Heathcliff sylvester
ELECT persian, heathcliff sylvester, heathcliff persian, sylvester
FOLLOW - - -
LEAD - - -
CRASH - - -
EXIT persian, heathcliff sylvester, heathcliff persian, sylvester
Notify Lists for each of the State Machines
Loki Fault Specification
Node name Fault name Boolean expression Freq.
sylvester sfault1 ((sylvester:INIT)&(heathcliff:INIT)) once
sylvester sfault2 ((sylvester:ELECT)&(heathcliff:ELECT)&(persian:ELECT)) once
heathcliff hfault1 ((sylvester:INIT)&(heathcliff:INIT)&(persian:INIT)) once
heathcliff hfault2 ((sylvester:ELECT)&(heathcliff:ELECT)) once
persian pfault1 ((sylvester:ELECT)&(heathcliff:ELECT)) once
Output from Global Event Timeline Calculation
persian 0 l INIT_LEAF 48186103765 0.000000persian 0 h INIT_LEAF 48186103842 0.000077sylvester 0 l INIT_LEAF 48186246754 0.142989sylvester 0 h INIT_LEAF 48186246782 0.143017heathcliff 0 l INIT_LEAF 4818679543 0.691666heathcliff 0 h INIT_LEAF 48186795431 0.691666heathcliff 0 l ELECT 48421412646 235.308881heathcliff 0 h ELECT 48421412646 235.308881persian 0 l ELECT 48656277005 470.173240persian 0 h ELECT 48656277500 470.173735sylvester 0 l ELECT 48656278777 470.175012sylvester 0 h ELECT 48656279316 470.175551sylvester 1 l sfault2 48658583337 472.479572sylvester 1 h sfault2 48658583878 472.480113heathcliff 0 l FOLLOW 48659144990 473.041225heathcliff 0 h FOLLOW 48659144990 473.041225heathcliff 0 l EXIT 48659155725 473.051960heathcliff 0 h EXIT 48659155725 473.051960
Results• Correct Fault Injection Probability as a Function of the Time Spent in the
ELECT State (Standard Linux Kernel - 10 ms Timeslice)
• Correction Fault Injection Probability as a Function of the Time Spent in theELECT State (Linux Kernel - 1 msec Timeslice)
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
10 12 14 16 18 20 22 24 26 28 30
Time in Elect state (ms)
Fre
qu
en
cy
of
inje
cti
on
s
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
10 12 14 16 18 20 22 24 26 28 30
Time in Elect state (ms)
Pro
ba
bil
ity
of
pro
pe
r in
jec
tio
n
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0 2 4 6 8 10 12 14 16 18 20
Time in Elect state (ms)
Fre
qu
en
cy
of
inje
cti
on
s
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 2 4 6 8 10 12 14 16 18 20
Time in Elect state (ms)
Pro
ba
bil
ity
of
pro
pe
r in
jec
tio
n
Loki Graphical Interface
Möbius Project Research Goal
• Development of tools to predict the performance, dependability, andperformability of distributed computing/communication systems
– Such systems are complex combinations of:
• Computing hardware
• Networks
• Operating systems
• Software
(Note: goal is not to prove logical system properties, although thismay be possible within framework)
• We believe such tools can be realized by:
– Developing a framework/tool that supports multiple modelingformalisms, at multiple levels of detail and abstraction, and multiplemodel solution methods
– Developing new model representation and solution methods(within the framework, and implemented in the tool) that scale wellwith increasing system complexity
Integrated Modeling Frameworks are Needed!
• No single formalism is best for representing all parts of a distributedcomputing/communication system
– Computer hardware, networks, protocols, and applications eachcall for a different representation
– Even within a “class” of application, different industry segmentsuse very different ways of representing a particular design
• No single solution method is adequate to solve all models
– Discrete-event simulation is efficient in many cases, but isextremely slow in others (e.g., significant, but rare events (likefaults and buffer overflows), or extreme system complexity)
• Research in new modeling methods and tools is significantlyhampered by the close link between model specification and modelsolution methods, and the closed nature of existing tools
Modelers Need Heterogeneous Models
FaultDescription
Components Protocol TrafficControl/
Data FlowResource
Contention
VHDLFault TreeLOTOS,Estelle
QueuingModel
BlockDiagramLanguage
StochasticPetri Nets,
SANs
Hardware Network Application OS
Computer System
?
Möbius Framework
• Model: An abstract representation of some system
• Formalism: A modeling language
• Framework: A “language” in which modeling languages may beexpressed
Formalism
Formalism
Formalism
Model
Model
Model
MöbiusFramework
Solver
Solver
Solver
The Möbius Framework ...
• Expresses most existing modeling languages (except some simulationlanguages)
• Retains the ability for efficient solution
• Facilitates homogeneous modeling
• Is a vehicle for researching new model composition, connection,reward specification, and solution methods
Möbius Framework Components
SolvableModel
AtomicModel
ComposedModel
SolvedModel
ConnectedModel
StateVariables
Properties
MöbiusExecution
Policy
FlexibleExecution
Policy
Well-SpecifiedChecker
ModelConnection
ActionsReward
VariablesModel
CompositionSolver Results
• The abstract functional interface allows models to affect each other and beacted on by solvers without understanding model semantics
• A project manager maintains consistency when constructing new models,performance/dependability variables, and studies from existing models
Abstract Functional Interface Facilitates Interaction ofModels and Solution Engines
Model Specification,Composition,Reward Definition,and ConnectionFormalisms
Simulation, State-Space Generation,andAnalytic/NumericalSolvers
Abstract FunctionalInterfaceInteractions
Model and Solver Editor Interactions
Project ManagerAFI
Möbius Tool Architecture
Linker
ModelSolution
SubmodelObject Code
ComposedModel Object
Code
PV Object CodeStudy EditorObject Code
Model Conn.Object Code
Model Editor PV Editor `Study Editor`Model ConnectorComposer
FormalismLibraries
SolverLibraries
Model Editor Interaction
Graphical User Interfaces
For More Information
www.crhc.uiuc.edu/PERFORM
Made possible by the dedicated work of:
Amy Christenson, Ramesh Chandra, Graham Clark, Tod Courtney, MichelCukier, Salam Derisavi, David Daly, Dan Deavours, Jay Doyle, KaustaubhJoshi, G. P. Kavanaugh, Ryan Lefever, John Sowder, Aaron Stillman,Patrick Webster, and Alex Williamson