Exploration Systems Engineering: Risk Module
Risk Module: Risk Management, Fault Trees and
Failure Mode Effects Analysis
Exploration Systems Engineering, version 1.0
Exploration Systems Engineering: Risk Module 2
Module Purpose: Risk
♦ To understand risk, risk management, fault tree analysis and failure mode effects analysis in the context of project development
♦ Acknowledge that risks are inevitable and recognize that through systematic management and analytic techniques they can be reduced
♦ Review three techniques that are used to discover, assess, rank and mitigate risk - risk management, fault tree analysis and failure mode effects analysis
Exploration Systems Engineering: Risk Module 3
What are Risks and Risk Management?
♦ Risks are potential events that have negative impacts on safety or project technical performance, cost or schedule
♦ Risks are an inevitable fact of life – risks can be reduced but never eliminated
♦ Risk Management comprises purposeful thought to the sources, magnitude, and mitigation of risk, and actions directed toward its balanced reduction
♦ The same tools and perspectives that are used to discover, manage and reduce risks can be used to discover, manage and increase project opportunities - opportunity management
Exploration Systems Engineering: Risk Module 4
What is Risk Management?
♦ Seeks or identifies risks ♦ Assesses the likelihood and impact of these risks ♦ Develops mitigation options for all identified risks ♦ Identifies the most significant risks and chooses which
mitigation options to implement ♦ Tracks progress to confirm that cumulative project risk is indeed
declining ♦ Communicates and documents the project risk status ♦ Repeats this process throughout the project life
Risk management is a continuous and iterative decision making technique designed to improve the probability of success. It is a proactive approach that:
Exploration Systems Engineering: Risk Module 5
Risk Management Considers the Entire Development and Operations Life of a Project
Risk Type
♦ Technical Performance Risk
♦ Cost Risk
♦ Programmatic Risk
♦ Schedule Risk
♦ Liability Risk
♦ Regulatory Risk
♦ Operational Risk
♦ Safety Risk
♦ Supportability Risk
Examples
♦ Failure to meet a spacecraft technical requirement or specification during verification
♦ Failure to stay within a cost cap for the project
♦ Failure to secure long-term political support
♦ Failure to meet a critical launch window
♦ Spacecraft deorbits prematurely causing damage over the debris footprint
♦ Failure to secure proper approvals for launch of nuclear materials
♦ Failure of spacecraft during mission
♦ Hazardous material release while fueling during ground operations
♦ Failure to resupply sufficient material to support human presence as planned
Exploration Systems Engineering: Risk Module 6
Every NASA Space Flight Project Begins with a Plan for Risk Management
♦ This plan reflects the project’s risk management philosophy: • Priority (criticality to long-term strategic plans) • National significance • Mission lifetime (primary baseline mission) • Estimated project life cycle cost • Launch constraints • In-flight maintenance feasibility • Alternative research opportunities or re-flight opportunities
♦ The risk management philosophy is reflected in a number of ways: • Whether single point failures are allowed • Whether the system is monitored continuously during operations • How much slack is in the development schedule • How technical resource margins (i.e., mass, power, MIPS, etc.) are
allocated throughout the development
Exploration Systems Engineering: Risk Module 7
Other Factors to Consider in Assessing Risk (but not limited to)…
♦ Complexity of management and technical interfaces ♦ Design and test margins ♦ Mission criticality ♦ Availability and allocation of resources such as mass, power,
volume, data volume, data rates, and computing resources ♦ Scheduling and manpower limitations ♦ Ability to adjust to cost and funding profile constraints ♦ Mission operations ♦ Data handling, i.e., acquisition, archiving, distribution and
analysis ♦ Launch system characteristics ♦ Available facilities
Exploration Systems Engineering: Risk Module 8
Risk Identification
♦ Risks are identified by the development team, peer reviews, lessons from past projects and expert review
♦ Lessons from past projects are captured via ‘trigger questions’, or questions that challenge a development strategy or design solution
♦ The project risk status and top ten risk list are reviewed periodically - usually monthly - and at the project milestone reviews
Exploration Systems Engineering: Risk Module 9
Example Risk Trigger Questions
♦ Have requirements been implemented such that a small change in requirements has the potential to cause large cost, performance or schedule system ramifications?
♦ Do designs or requirements push the current state-of-the-art?
♦ Has the concept for operating, maintaining, decommissioning or disposal of the system been adequately defined to ensure the identification of all requirements?
♦ Has an independent cost estimate (ICE) been performed? ♦ Is the schedule adequate to handle the level of requirements or
objectives changes that are occurring or are likely to occur?
♦ Have the necessary facilities for environmental test been identified and availability problems been resolved?
Exploration Systems Engineering: Risk Module 10
More Considerations for Risk Discovery
While each space project has its unique risks, a list of the underlying sources of risks would include the following:
♦ Technical complexity - many design constraints or many dependent operational sequences having to occur in the right sequence and at the right time
♦ Organizational complexity - many independent organizations having to perform with limited coordination
♦ Inadequate margins or reserves ♦ Inadequate implementation plans ♦ Unrealistic schedules ♦ Total and year-by-year budgets mismatched to the actual implementation
risks ♦ Over-optimistic designs pressured by mission expectations ♦ Limited engineering analysis and understanding due to inadequate
engineering tools and models ♦ Limited understanding of the mission’s space environments ♦ Inadequately trained or inexperienced project personnel ♦ Inadequate processes or inadequate adherence to proven processes
Exploration Systems Engineering: Risk Module
Pause and Learn Opportunity
Engage the class in identifying risks for a familiar project. • What kinds of risks are identified? • What is the basis for their search for risks? After the class has thought for a while, the instructor could present some trigger questions which may help discover new risks and show the value of the trigger questions.
Exploration Systems Engineering: Risk Module 12
Cartoon: Dilbert Identifies Risks
© United Features Syndicate, Inc.
Exploration Systems Engineering: Risk Module 13
The Benefits of Preparing for the Unexpected
Mars Spirit Rover Flash Memory Problem “The thing that strikes me most about all this is how critical it was to have that INIT_CRIPPLED command in the system. It’s not the kind of command that you’d ever expect to use under normal conditions on Mars. But back during the earliest days of the project Glenn realized that someday we might need the flexibility to deal with a broken flash file system, and he put INIT_CRIPPLED in the system and left it there. And when the anomaly hit, it saved the mission.” –From “Roving Mars” by Steve Squires, Hyperion 2005 Be prepared for the low probability event with a huge consequence.
Background:"On January 21, 2004 (Sol 18), Spirit abruptly ceased communicating with mission control. The next day the rover radioed a 7.8 bit/s beep, confirming that it had received a transmission from Earth but indicating that the spacecraft believed it was in a fault mode."
Exploration Systems Engineering: Risk Module 14
After Identification Risks are Assessed
♦ Risks are assessed by characterizing the probability that a project will experience an undesired event and the consequences, impact or severity of the undesired event, were it to occur
♦ Risks can be compared on iso-curves consisting of a likelihood measure and a consequence measure
♦ Since the assessment of the likelihood and consequence of a risk is both subjective and has significant uncertainty the characterization of risk either qualitative (low medium or high) or semi-quantitative (risk are captured on a 5x5 matrix)
High Risk Medium Risk
Low Risk
Severity of Consequence
Like
lihoo
d (P
roba
bilit
y)
0.0
1.0
Exploration Systems Engineering: Risk Module 15
An Example of Some Semi-Quantitative Definitions to Enable a Project to Compare and Rank Risks
Impact of Consequences Class Technical Schedule Cost
Class I Catastrophic
(Scale 5)
A condition that may cause death or permanently disabling injury, facility destruction on the ground, or loss of crew, major systems, or vehicle during the mission
launch window to be missed
cost overrun > 50 % of planned cost
Class II Critical
(Scale 4)
A condition that may cause severe injury or occupational illness, or major property damage to facilities, systems, equipment, or flight hardware
schedule slippage causing launch date to be missed
cost overrun 15 % to 50 % of planned cost
Class III Moderate (Scale 3)
A condition that may cause minor injury or occupational illness, or minor property damage to facilities, systems, equipment, or flight hardware
internal schedule slip that does not impact launch date
cost overrun 2 % to 15 % of planned cost
Class IV Negligible (Scale 2)
A condition that could cause the need for minor first aid treatment but would not adversely affect personal safety or health; damage to facilities, equipment, or flight hardware more than normal wear and tear level
internal schedule slip that does not impact internal development milestones
cost overrun < 2 % of planned cost
Probability of Occurrence
Scale Measure
5 Near certain to occur (80-100%).
4 Highly likely to occur (60-80%).
3 Likely to occur (40-60%).
2 Unlikely to occur (20-40%).
1 Not likely; Improbable (0-20%).
Exploration Systems Engineering: Risk Module 16
A 5x5 Risk Matrix Provides a Quick Visual Comparison of All Project Risks
High risks – mission success jeopardized - immediate action required Medium risk – review regularly – contingent action if does not improve Low risk – watch and review periodically
Exploration Systems Engineering: Risk Module 17
Approach M - Mitigate W - Watch A - Accept R - Research
5
4
3
2
1
1
Like
lihoo
d
CONSEQUENCES
Med High
Low
Criticality L x C Trend Decreasing (Improving) Increasing (Worsening) Unchanged New Since Last Period
More flight testing may be required for Soft V&V
R DFRC-02 8
Limited Flight Envelope, due to technical issues
R DFRC-04 7
Payload Capacity & Volume Trade-offs design issues
R DFRC-11 6
Avionics software behind schedule
W DFRC-01 5
Quality Control Resources insufficient
A DFRC-24 4
Cost growth for engine components
W DFRC-07 3
Sched Integration problems structure vs.. avionics
M DFRC-12 2
Landing Gear Door System Failure
R DFRC-34 1 Risk Title
Appr oach
Risk ID
Rank & Trend
1
2
3
4 5 6
7 8
Top Risks and their Trends are Periodically Reviewed for the SOFIA Project
2 3 4 5
SOFIA Risk Matrix
Exploration Systems Engineering: Risk Module 18
Top Risks and their Trends are Periodically Reviewed for the Constellation SE&I
1
5
4
3
2
1
1 2 3 4 5CONSEQUENCE
3
48
1, 276
5
SE&I Top Risk List
LIKELIHOOD
33003SE&I_PTI_HR
!!1046 - Tailoring of Human -Rating requirements
!8
40004SE&I_SOA
! 1195 - CxP Lifecycle cost!7
33334CSI_SIG! 1125 - Software Development and Assurance
!6
44435SE&I_SOA
! 1603 - (SRR) Abort Site Sea State Limits Launch Availability
N5
40403SE&I -AT&A
! 1135 - Program Visibility for Closing the Architecture
!4
22202SE&I -PRIMO
! 1122 - Requirements Maturation
!3
44554FP_SIG! 1676 - Structural loads on CEV and LSAM during TLI
N2
! 1677 - Ares I/Orion Ascent Aeroacoustic Environments
Title
N
Trend
555
COST
SCHED
PERF
ConsequenceLIKE
Owning Team
Rank
44FP_SIG1
SAFE
Top Project Risk ( TProjR )!
Top Program Risk (TPR)!
Top Directorate Risk (TDR)!
Unchanged!
Increasing (Worsening)!
Decreasing (Improving)!
Legend
Exploration Systems Engineering: Risk Module 19
The Status of the Most Significant Risks and Their Mitigation Options are Reviewed Periodically ♦ Title of risk ♦ Description or Root cause ♦ Possible categorizations
• System or subsystem • Cause category (technology, programmatic, cost, schedule, etc.) • Resources affected (budget, schedule slack, technical margins, etc.)
♦ Owner ♦ Assessment of Implementation risk or Mission risk
• Likelihood - estimate of the probability of the risk event • Consequences - estimate of the performance, cost, safety and
schedule effects ♦ Mitigation
• Description, including costs of mitigation options • Mitigation option leverage or reduction in the assessed risk • Current mitigation activities • Current trends in risk significance - likelihood and impact
♦ Significant milestones • Opening and closing of the window of occurrence • Decision points for mitigation implementation effectiveness
Exploration Systems Engineering: Risk Module
Part 2 of Risk Module: Fault Tree Analysis Event Tree Analysis
Exploration Systems Engineering: Risk Module 21
Fault Tree Analysis Supports Design Decisions and Failure Investigations
♦ Fault Tree Analysis - FTA - uses a top-down symbolic logic model and estimates of failure probabilities of ‘initiators’ to estimate the occurrence (failure) of the pre-determined, undesirable, ‘top’ event
♦ An initiator is a credible undesirable event that is a contributing
cause to top event failure ♦ ‘Cut sets’ are groups of initiators, when taken together, cause
top event failure ♦ ‘Path sets’ are groups of initiators that if none occur the top
event does not fail
♦ FTA is both a design and a diagnostic tool ♦ As a design tool FTA is used to compare alternative design
solutions and the resulting TOP event probability ♦ As a diagnostic tool FTA is used to investigate scenarios that
may have led to the TOP event failure - leading to an estimate of the most likely cut sets
Exploration Systems Engineering: Risk Module 22
Fault tree analysis is a graphical representation of the combination of faults that will result in the occurrence of some (undesired) top event. In the construction of a fault tree, successive subordinate failure events are identified and logically linked to the top event. The linked events form a tree structure connected by symbols called gates.
Fault Tree Analysis
Exploration Systems Engineering: Risk Module
Refer to NASA Reference Publication 1358: System Engineering “Toolbox” for
Design-Oriented Engineers
Section 3.6: Fault Tree Analysis (Handout)
Particular points: And/Or Gates explanation
Example Fault Tree (Fig 3-20)
Exploration Systems Engineering: Risk Module 24
Event Trees
♦ Event trees can be viewed as a special case of fault trees, where the branches are all ORs weighted by their probabilities.
♦ Event trees are generated both in the success and failure domains.
♦ This technique explores system responses to an initiating “challenge” and enables assessment of the probability of an unfavorable or favorable outcome. The system challenge may be a failure or fault, an undesirable event, or a normal system operating command.
♦ In constructing the event tree, one traces each path to eventual success or failure.
♦ This technique is typically performed in phase C but may also be performed in phase B.
♦ See NASA Reference Publication 1358: System Engineering “Toolbox” for Design-Oriented Engineers section 3.8 for additional discussion.
Exploration Systems Engineering: Risk Module 25
Will the Stage Make it from Hangman’s Hill to Placer Gulch?
Station Probability of no horses
1, 2, 3 0.2
4 0.1
Placer Gulch event tree example from a Safety &
Mission Assurance training course by Pat Clemons of
Sverdrup.
Exploration Systems Engineering: Risk Module 26
Fault Tree Analysis of the Placer Gulch Stage
Exploration Systems Engineering: Risk Module
Part 3 of Risk Module: Failure Mode Effects Analysis
Exploration Systems Engineering: Risk Module 28
Failure Mode Effects Analysis
• Objective • To ensure all failure modes have been identified and evaluated
• Technique • Select a method to rank project failure modes • Identify failure modes including all single point failure modes • Analyze failure modes and their mission effect • Determine those failure modes that might benefit from
corrective action, e.g., – Alternative designs – Redundancy – Increased reliability
• Determine which, if any, corrective actions will be implemented
Exploration Systems Engineering: Risk Module 29
Failure Mode Effects Analysis
♦ FMEA is a design tool for identifying risk in the system or mission design, with the intent of mitigating those risks with design changes. The FMEA risk mitigation:
1. Recognizes and evaluates the potential failure of a system and its effects;
2. Identifies actions which could eliminate or reduce the chance of a potential failure occurring.
♦ FMEA is initiated in Phase B (Preliminary Design) and used to support design decisions in Phase C (Final Design).
Exploration Systems Engineering: Risk Module 30
Failure Mode and Effects Analysis
Item Potential Potential S e v
C l a s s
O c c u r Current
D e t e c
R P N Responsibility Actions Results
Failure Effects of Recommended & Target Function Mode Failure
Potential Causes/ Mechanisms(s)
Failure Controls
Action(s) Completion Date D e t O c c
R P N Actions Taken
S e v
What are the functions
or requirements?
What can go wrong? - No Function - Partially Degraded Function - Intermittent Function - Unintended Function
What are the
Effects?
How bad is it?
What are the Cause(s)?
How often does
it happen
?
How can this be prevented and detected?
How good is
this method
at detecting
it?
What can be done? - Design changes
- Process changes
- Special controls
- Changes to standards, procedures, or guides
Prevention/Detection
Who is going to do it and when?
What did they do and what
are the outcomes
Exploration Systems Engineering: Risk Module 31
Module Summary: Risk
♦ Risk is inevitable, so risks can be reduced but not eliminated.
♦ Risk management is a proactive systematic approach to assessing risks, generating alternatives and reducing cumulative project risk.
♦ Fault Tree Analysis is both a design and a diagnostic tool that estimates failure probabilities of initiators to estimate the failure of the pre-determined, undesirable, ‘top’ event.
♦ Failure Mode Effects Analysis is a design tool for identifying risk in the system design, with the intent of mitigating those risks with design changes.
Exploration Systems Engineering: Risk Module
Backup Slides for Risk Module
Exploration Systems Engineering: Risk Module 33
Uncertainties that Plague Projects
Uncertainties Offsets
Mission Objectives
♦ Will the baseline system satisfy the needs & objectives?
♦ Are they the best ones?
♦ Thorough study ♦ Analyses ♦ Cost & schedule credibility
Technical Factors
♦ Can baseline technology achieve the objectives?
♦ Can the specified technology be attained?
♦ Are all the requirements known?
♦ Technology development plan ♦ Paper studies ♦ Design reviews ♦ Establish performance
margins ♦ Engineering model test and
prototyping ♦ Test & evaluation
Internal Factors
♦ Can the plan and strategy meet the objectives?
♦ Resources • Manpower skills • Time • Facilities
♦ Program strategy ♦ Budget allocations ♦ Contingency planning
External Factors ♦ Will outside influences jeopardize
the project? ♦ Contingency ♦ Robust design
Exploration Systems Engineering: Risk Module 34
Project Risk Categories
Typical Technical
Risk Sources
Typical Programmatic Risk Sources
Typical Supportability Risk Sources
Typical Cost
Risk Sources
Typical Schedule
Risk Sources • Physical properties • Material properties • Radiation properties • Testing/Modeling • Integration/Interface • Software Design • Safety • Requirement
changes • Fault detection • Operating
environment • Proven/Unproven
technology • System complexity • Unique/Special
Resources • COTS performance • Embedded training
• Material availability • Personnel availability • Personnel skills • Safety • Security • Environmental
impact • Communication
problems • Labor strikes • Requirement
changes • Stakeholder
advocacy • Contractor stability • Funding continuity
and profile • Regulatory changes
• Reliability and maintainability
• Training • Operations and
support • Manpower
considerations • Facility
considerations • Interoperability
considerations • System safety • Technical data
• Sensitivity to technical risk
• Sensitivity to programmatic risk
• Sensitivity to supportability risk
• Sensitivity to schedule risk
• Labor rates • Estimating error
• Sensitivity to technical risk
• Sensitivity to programmatic risk
• Sensitivity to supportability risk
• Sensitivity to cost risk
• Degree of currency • Number of critical
path items • Estimating error