Upload
marshall-benson
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Runtime Verification and Validation
for
Multi-Core Based On-Board Computing
Hans P. Zima and Mark L. James
Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA{zima,mjames}@jpl.nasa.gov
High Performance Embedded Computing (HPEC)Workshop
MIT Lincoln Laboratory, September 23rd, 2009
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
Contents
The work described in this presentation has been funded by the JPL Research and Technology (R&TD) project 01STCR R.08.023.015 “Introspection Framework for Fault Tolerance in Support of Autonomous Space Systems”
Neptune Triton Explorer
Europa Astrobiology Laboratory
Titan Explorer Europa
Mars Sample Return
Explorer
NASA/JPL: Potential Future MissionsArtist ConceptArtist Concept
New Requirements
Future missions and the limited downlink toFuture missions and the limited downlink to
Earth lead to two major new requirements:Earth lead to two major new requirements:
1. Autonomy
2. High-Capability On-Board Computing
1. Autonomy
2. High-Capability On-Board Computing Missions will require on-board computational power ranging from tens of Gigaflops to Teraflops.
Emerging multi-core technology is expected to provide this capability
Missions will require on-board computational power ranging from tens of Gigaflops to Teraflops.
Emerging multi-core technology is expected to provide this capability
Future Multi-Core Architectures:From 10s to 100s of Processors on a Chip
Tile64 (Tilera Corporation, 2007)Tile64 (Tilera Corporation, 2007) 64 identical cores, arranged in an 8X8 grid
iMesh on-chip network, 27 Tb/sec bandwidth
170-300mW per core; 600 MHz – 1 GHz
192 GOPS (32 bit)—about 10 GOPs/Watt
Maestro (2010/11)Maestro (2010/11) RHBD 7x7 grid SW-compatible version of Tile64 with FP
270mW per core; 480 MHz; 70 GOPs; max 28W
Kilocore 1025Kilocore 1025 (Rapport Inc. and IBM, 2008) (Rapport Inc. and IBM, 2008) Power PC and 1024 8-bit processing elements (125MHz)
32X32 “stripes” dedicated to different tasks
80-core research chip from Intel (2011)80-core research chip from Intel (2011) 2D on-chip mesh network for message passing
1.01 TF (3.16 GHz); 62W power—16 GOPS/Watt
Tile64 (Tilera Corporation, 2007)Tile64 (Tilera Corporation, 2007) 64 identical cores, arranged in an 8X8 grid
iMesh on-chip network, 27 Tb/sec bandwidth
170-300mW per core; 600 MHz – 1 GHz
192 GOPS (32 bit)—about 10 GOPs/Watt
Maestro (2010/11)Maestro (2010/11) RHBD 7x7 grid SW-compatible version of Tile64 with FP
270mW per core; 480 MHz; 70 GOPs; max 28W
Kilocore 1025Kilocore 1025 (Rapport Inc. and IBM, 2008) (Rapport Inc. and IBM, 2008) Power PC and 1024 8-bit processing elements (125MHz)
32X32 “stripes” dedicated to different tasks
80-core research chip from Intel (2011)80-core research chip from Intel (2011) 2D on-chip mesh network for message passing
1.01 TF (3.16 GHz); 62W power—16 GOPS/Watt
The issue of dependability, and in particular fault tolerance, has to be addressed in this new contextThe issue of dependability, and in particular fault tolerance, has to be addressed in this new context
Dependability*
*A. Avizienis, J.-C.Laprie, B.Randell: Fundamental Concepts of Dependability. UCLA CSD Report 010028, 2000
Means
Dependability Attributes
Threats
Faults
Errors
Failures
Availability
Reliability
Safety
Integrity
Maintainability
Fault Prevention
Fault Removal
Fault Tolerance
The ability of a computing system to deliver service that can be justifiably trusted
(Fault Protection)
Threats: the Fault-Error-Failure Chain
Fault
defect in a system
Error Error
invalid system state
Failure
violation of system specification
activation
propagation propagation to service boundary
external fault(caused by external failure)
System Boundary (Service Interface)
Means (Fault Protection)
Fault Prevention: Fault Prevention: via via quality control quality control during design and manufacturing during design and manufacturing of hardware and softwareof hardware and software structured programming, modularization, information hiding; firewalls
shielding and radiation hardening
…
Fault Removal: Fault Removal: Verification and Validation (V&V),model checking, etc.Verification and Validation (V&V),model checking, etc.
Fault Tolerance: the ability to preserve the delivery of Fault Tolerance: the ability to preserve the delivery of correct service (system specification) in the presence of correct service (system specification) in the presence of active faultsactive faults error detection
recovery: error handling and fault handling
fault masking: redundancy-based recovery without explicit error detection (e.g., TMR)
Fault Prevention: Fault Prevention: via via quality control quality control during design and manufacturing during design and manufacturing of hardware and softwareof hardware and software structured programming, modularization, information hiding; firewalls
shielding and radiation hardening
…
Fault Removal: Fault Removal: Verification and Validation (V&V),model checking, etc.Verification and Validation (V&V),model checking, etc.
Fault Tolerance: the ability to preserve the delivery of Fault Tolerance: the ability to preserve the delivery of correct service (system specification) in the presence of correct service (system specification) in the presence of active faultsactive faults error detection
recovery: error handling and fault handling
fault masking: redundancy-based recovery without explicit error detection (e.g., TMR)
Fault Tolerance
Fault Error Failureactivation
propagation to service boundary
fault tolerance
well-definedsystem state
Xelimination of detected errors
prevention of fault activations
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
Contents
Introspection…Introspection… provides provides dynamicdynamic monitoring, analysis, and feedback, monitoring, analysis, and feedback,
enabling system to become self-aware and context-aware: enabling system to become self-aware and context-aware: monitoring execution behavior
reasoning about its internal state
changing the system or system state when necessary
exploits adaptively the threads available in a multi-core exploits adaptively the threads available in a multi-core systemsystem
can be applied to a range of different scenarios, including:can be applied to a range of different scenarios, including: fault tolerance
performance tuning
power management
behavior analysis
Introspection…Introspection… provides provides dynamicdynamic monitoring, analysis, and feedback, monitoring, analysis, and feedback,
enabling system to become self-aware and context-aware: enabling system to become self-aware and context-aware: monitoring execution behavior
reasoning about its internal state
changing the system or system state when necessary
exploits adaptively the threads available in a multi-core exploits adaptively the threads available in a multi-core systemsystem
can be applied to a range of different scenarios, including:can be applied to a range of different scenarios, including: fault tolerance
performance tuning
power management
behavior analysis
A Framework for Introspection
An Introspection Module
Application
Introspection System sensors
actuators
.
.
.
.
.
.
Inference Engine(SHINE)
Monitoring
Analysis
Recovery
Prognostics
KnowledgeBase
System Knowledge
Application Knowledge
Domain Knowledge
…
Example: SHINE Diagnostics and Prognostics Architecture for DSN Health Management
BEAM Frame Generator
Time Variant Slot Filler
FDIR Front-End, DSN Processor
Stability State Estimate
Stability Limit Estimate
Inter-Channel Relationships
Model-based State Diagnosis
Fault Identification
Prognostics
State Summarizer
KB KB
KB
Stability Limit Checker
GUI GUI
Eng
lish-
Bas
ed
Dia
gnos
is
Blo
ck
Dia
gram
s
Planner GUIF
ailu
re
Pre
dict
ions
KB
GUI
Cha
nnel
Dia
gnos
is
GUI
BEAM Specific Displays
System-Wide Health Assessment(SHINE)
Diagnostics(SHINE)
Planning
Reports &
LogsFiles
Complex, Station and Subsystem Level Health Summaries
Station Monitor Data
FDIR Data Handler
Channel Level Diagnostics
DSSC Health Assessor (Heartbeats)
Failing Channel Contexts
Prognostics
GUI
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
Contents
Traditional V&V versus Introspection Key Differences
Verification and Validation (V&V) Verification and Validation (V&V) is applied before actual program execution
to static source program in the context of test executions for particular input set(s)
focuses on design errors
IntrospectionIntrospection performs performs execution time execution time monitoring, analysis, recoverymonitoring, analysis, recovery
current approach focuses on transient errorscurrent approach focuses on transient errors
can, in principle, also address design errors can, in principle, also address design errors
Verification and Validation (V&V) Verification and Validation (V&V) is applied before actual program execution
to static source program in the context of test executions for particular input set(s)
focuses on design errors
IntrospectionIntrospection performs performs execution time execution time monitoring, analysis, recoverymonitoring, analysis, recovery
current approach focuses on transient errorscurrent approach focuses on transient errors
can, in principle, also address design errors can, in principle, also address design errors
Limitations of V&V
Verification: theoretical limitsVerification: theoretical limits undecidability : many problems are inherently unsolvable
halting problem constant propagation
NP-completeness: many theoretically solvable problems are intractable due to exponential complexity
SAT problem many graph problems
Model checking: subject to scalability challenge Model checking: subject to scalability challenge exponential growth of state space
TestTest tests can prove presence or absence of faults for specific input sets
but they cannot prove their absence for all inputs (Edsger Dijkstra)
V&V is V&V is inherently inherently unableunable to deal with transient errors or to deal with transient errors or execution anomaliesexecution anomalies
Verification: theoretical limitsVerification: theoretical limits undecidability : many problems are inherently unsolvable
halting problem constant propagation
NP-completeness: many theoretically solvable problems are intractable due to exponential complexity
SAT problem many graph problems
Model checking: subject to scalability challenge Model checking: subject to scalability challenge exponential growth of state space
TestTest tests can prove presence or absence of faults for specific input sets
but they cannot prove their absence for all inputs (Edsger Dijkstra)
V&V is V&V is inherently inherently unableunable to deal with transient errors or to deal with transient errors or execution anomaliesexecution anomalies
Introspection Can Complement V&V
Introspection performs Introspection performs execution time execution time monitoring, monitoring, analysis, recoveryanalysis, recovery
Introspection can deal with transient errors, execution Introspection can deal with transient errors, execution anomalies, performance problemsanomalies, performance problems this capability is inherently beyond the scope of V&V technologythis capability is inherently beyond the scope of V&V technology
and it can be extended to deal with design errorsand it can be extended to deal with design errors
Future Goal: integration of introspection with current V&V Future Goal: integration of introspection with current V&V technology into a comprehensive V&V schemetechnology into a comprehensive V&V scheme
Introspection performs Introspection performs execution time execution time monitoring, monitoring, analysis, recoveryanalysis, recovery
Introspection can deal with transient errors, execution Introspection can deal with transient errors, execution anomalies, performance problemsanomalies, performance problems this capability is inherently beyond the scope of V&V technologythis capability is inherently beyond the scope of V&V technology
and it can be extended to deal with design errorsand it can be extended to deal with design errors
Future Goal: integration of introspection with current V&V Future Goal: integration of introspection with current V&V technology into a comprehensive V&V schemetechnology into a comprehensive V&V scheme
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
1.1. IntroductionIntroduction
2.2. An Introspection Framework for Fault ToleranceAn Introspection Framework for Fault Tolerance
3.3. V&V versus IntrospectionV&V versus Introspection
4.4. Automatic Generation of Fault-Tolerant CodeAutomatic Generation of Fault-Tolerant Code
5.5. Concluding RemarksConcluding Remarks
Contents
Effects of Single Event Upsets (SEUs)
SEUs and MBUs are radiation-induced transient hardware SEUs and MBUs are radiation-induced transient hardware errors, which may corrupt software in multiple ways:errors, which may corrupt software in multiple ways: instruction codes and addresses
user data structures
synchronization objects
protected OS data structures
synchronization and communication
Potential effects include:Potential effects include: wrong or illegal instruction codes and addresses
wrong user data in registers, cache, or DRAM
control flow errors
unwarranted exceptions
hangs and crashes
synchronization and communication faults
SEUs and MBUs are radiation-induced transient hardware SEUs and MBUs are radiation-induced transient hardware errors, which may corrupt software in multiple ways:errors, which may corrupt software in multiple ways: instruction codes and addresses
user data structures
synchronization objects
protected OS data structures
synchronization and communication
Potential effects include:Potential effects include: wrong or illegal instruction codes and addresses
wrong user data in registers, cache, or DRAM
control flow errors
unwarranted exceptions
hangs and crashes
synchronization and communication faults
Design and Interaction Faults …
…can result in errors similar to those that can be caused by transient faults
…can result in errors similar to those that can be caused by transient faults
Fault Examples: Sequential threads
S1: a= exp …
S2: x=f(a)
fault fault
S1: a= exp …
S2: x=f(a)
X undefined assignment to x
for i in 1:n { S(i)) }
corruption ofassignment to variable a
fault fault
corruption ofvariable n
destruction of loop rangefor i in ?:? { S(i)) }
if cond then S1 else S2end if
fault fault
corruption ofcondition cond
if ??? then ? else ?end if
branch to wrong orundefined target
X
X
Example: Parallel Algorithms Safety Violation
repeat forever …wait(mutex)
signal(mutex) …end repeat
repeat forever …wait(mutex)
signal(mutex) …end repeat
Resource R
Critical Section 1: Exclusive access of
P1 to R
Critical Section 2: Exclusive access of
P2 to R
P1 P2mutex mutex
Corruption of mutex destroys safety property of program
mutex
fau
ltfa
ult
XX
P1: P2:
Example: Parallel Algorithms Data Race
forall i in D, independent , A(f(i)) = exp(i) end
A(1)…
A(j)…A(n)
data race
A(f(i1))A(f(i1))
A(f(i2))A(f(i2))
f(i1)=f(i2)=jf(i1)=f(i2)=j
Fault resulting in theexistence of indicesi1, i2, such that:
Fault resulting in theexistence of indicesi1, i2, such that:
Example: Parallel Algorithms Deadlock
P0P0
P1P1 Pn-1Pn-1
R1R1
R1R1
RnRn
……
Cyclic resource allocation graph results in deadlock(as a result of programming or runtime error)
Static approach/analysis: deadlock preventionRuntime analysis: deadlock avoidance and detection
High-Performance Computing has produced a wealth of High-Performance Computing has produced a wealth of methods for parallel program analysis since the 1990smethods for parallel program analysis since the 1990s source program static analysis and restructuring dynamic performance and behavior analysis of program executions
Many of these methods can be adapted to or directly Many of these methods can be adapted to or directly applied to fault toleranceapplied to fault tolerance static program analysis (variable use, dependences, call chains) dynamic analysis of program and data flow
This provides a basis for the generation of error-This provides a basis for the generation of error-checking assertions and error correction and recovery checking assertions and error correction and recovery
High-Performance Computing has produced a wealth of High-Performance Computing has produced a wealth of methods for parallel program analysis since the 1990smethods for parallel program analysis since the 1990s source program static analysis and restructuring dynamic performance and behavior analysis of program executions
Many of these methods can be adapted to or directly Many of these methods can be adapted to or directly applied to fault toleranceapplied to fault tolerance static program analysis (variable use, dependences, call chains) dynamic analysis of program and data flow
This provides a basis for the generation of error-This provides a basis for the generation of error-checking assertions and error correction and recovery checking assertions and error correction and recovery
Leveraging Results from HPC ProgramAnalysis Technology for Fault Tolerance
Analysis Static analysis and profiling determine properties of Static analysis and profiling determine properties of
dynamic program behavior dynamic program behavior beforebefore actual execution actual execution
Analysis of Sequential ThreadsAnalysis of Sequential Threads control flow graph: represents the control structure in a program unit
data flow analysis: solves data flow problems over a program graph
dependence analysis: determines relationships between assignments of values to (possibly subscripted, or pointer) variables and their uses
program slice: the set of all statements that can affect a variable’s value
call graph: representing method/function/procedure calling relationships
Analysis of Parallel ConstructsAnalysis of Parallel Constructs data parallel loops: analysis of “independence” property
locality and communication analysis
race condition analysis
safety and liveness analysis
deadlock analysis
Static analysis and profiling determine properties of Static analysis and profiling determine properties of dynamic program behavior dynamic program behavior beforebefore actual execution actual execution
Analysis of Sequential ThreadsAnalysis of Sequential Threads control flow graph: represents the control structure in a program unit
data flow analysis: solves data flow problems over a program graph
dependence analysis: determines relationships between assignments of values to (possibly subscripted, or pointer) variables and their uses
program slice: the set of all statements that can affect a variable’s value
call graph: representing method/function/procedure calling relationships
Analysis of Parallel ConstructsAnalysis of Parallel Constructs data parallel loops: analysis of “independence” property
locality and communication analysis
race condition analysis
safety and liveness analysis
deadlock analysis
Control Flow and Data Flow Analysis
Control Flow Graph Control Flow Graph G=(N, E, nG=(N, E, n00)) models the control structure in a
program unit (control flow analysis)
N : set of basic blocks
E: control transfers between basic blocks
n0 : initial node
Monotone Data Flow Systems Monotone Data Flow Systems M=(L, F, G, g)M=(L, F, G, g) L is a bounded semilattice representing
the objects of interest, e.g., “definitions” reaching a basic block, variable-value associations, available expressions
F: monotone function space over L
G=(N,E, n0) flow graph; g: N F
under appropriate conditions, an optimal “meet over all paths (MOP)” solution can be determined by a general algorithm
Control Flow Graph Control Flow Graph G=(N, E, nG=(N, E, n00)) models the control structure in a
program unit (control flow analysis)
N : set of basic blocks
E: control transfers between basic blocks
n0 : initial node
Monotone Data Flow Systems Monotone Data Flow Systems M=(L, F, G, g)M=(L, F, G, g) L is a bounded semilattice representing
the objects of interest, e.g., “definitions” reaching a basic block, variable-value associations, available expressions
F: monotone function space over L
G=(N,E, n0) flow graph; g: N F
under appropriate conditions, an optimal “meet over all paths (MOP)” solution can be determined by a general algorithm
1
2
3 4
5
6
n0
N={1,2,3,4,5,6}E={(1,2),(2,3), (2,4),(3,5),(4,5),(5,6),(6,2)}L: P(DEFS) powerset of all “definitions”g(n) transformation function for n N
g(1)
g(2)
g(4)g(3)
g(5)
g(6)
For example, the transformation along path 12356 can be computed asg(6) o g(5) o g(3) o g(2) o g(1)(L0)
Dependence Dependence is a relation in a set of statement executions is a relation in a set of statement executions that characterizes their access (R/W) to common variablesthat characterizes their access (R/W) to common variables
In general, dependence analysis may In general, dependence analysis may not not be statically be statically feasible—for example, in a 2D Euler solver performing a feasible—for example, in a 2D Euler solver performing a sweep over an irregular gridsweep over an irregular grid
“ “Optimistic parallelization” (or the correctness of an Optimistic parallelization” (or the correctness of an independentindependent assertion) may have to be dynamically verified assertion) may have to be dynamically verified
Dependence Dependence is a relation in a set of statement executions is a relation in a set of statement executions that characterizes their access (R/W) to common variablesthat characterizes their access (R/W) to common variables
In general, dependence analysis may In general, dependence analysis may not not be statically be statically feasible—for example, in a 2D Euler solver performing a feasible—for example, in a 2D Euler solver performing a sweep over an irregular gridsweep over an irregular grid
“ “Optimistic parallelization” (or the correctness of an Optimistic parallelization” (or the correctness of an independentindependent assertion) may have to be dynamically verified assertion) may have to be dynamically verified
Dependence Analysis
for e in all_edges do [ independent ] … delta=f(GRID(vx1(e)).V1, GRID(vx2(e)).V1) … GRID(vx1(e)).V2 -= delta GRID(vx2(e)).V2 += delta …end
…
…
…… e
GRID(vx1(e))
GRID(vx2(e))
Generation of Assertion-Based Fault Tolerance
P
source program
instrumented source programwith error detectors
P*
instrumentationerror detector generation
program analysis profiling
Assertions
User
Knowledge Base
assert (A(i)>B(i)) and (D<Epsilon) at L; error (FT1,i,A(i),B(i),D,Epsilon)
assert (x < y+1000) invariant in (Loop1); error (FT2,…)
assert independence in (ParLoop); error (FT3,…)
Examples:
Future deep-space missions will require on-board high-Future deep-space missions will require on-board high-capability computing for support of autonomy and sciencecapability computing for support of autonomy and science
IntrospectionIntrospection provides a generic framework for dynamic monitoring and analysis of provides a generic framework for dynamic monitoring and analysis of
program executionprogram execution
can exploit multi-core technologycan exploit multi-core technology
a prototype framework for introspection supporting fault tolerance has a prototype framework for introspection supporting fault tolerance has been implemented for the Tile64 architecturebeen implemented for the Tile64 architecture
Future work will address automatic analysis and generation of Future work will address automatic analysis and generation of application-adaptive fault-tolerant code application-adaptive fault-tolerant code
Integration of introspection with conventional V&V technology adds a new dimension to fault tolerance
Future deep-space missions will require on-board high-Future deep-space missions will require on-board high-capability computing for support of autonomy and sciencecapability computing for support of autonomy and science
IntrospectionIntrospection provides a generic framework for dynamic monitoring and analysis of provides a generic framework for dynamic monitoring and analysis of
program executionprogram execution
can exploit multi-core technologycan exploit multi-core technology
a prototype framework for introspection supporting fault tolerance has a prototype framework for introspection supporting fault tolerance has been implemented for the Tile64 architecturebeen implemented for the Tile64 architecture
Future work will address automatic analysis and generation of Future work will address automatic analysis and generation of application-adaptive fault-tolerant code application-adaptive fault-tolerant code
Integration of introspection with conventional V&V technology adds a new dimension to fault tolerance
Concluding Remarks
Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise,does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, CaliforniaInstitute of Technology
Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise,does not constitute or imply its endorsement by the United States Government or the Jet Propulsion Laboratory, CaliforniaInstitute of Technology