2002 Annual RELIABILITY and MAINTAINABILITY Symposium i
Fault Tree Analysis in Product Reliability Improvement
Milena Krasich, P.E.
Milena Krasich, PE; Bose Corporation;
MS 450; The Mountain, Framingham MA 01701-7330 USA e-mail: [email protected].
2002 Annual RELIABILITY and MAINTAINABILITY Symposium ii
SUMMARY & PURPOSE This tutorial introduces the use of a well known technique of the Fault Tree Analysis as a tool in
reliability modeling and analysis of an electronics of mechanical design (including software), identification of potential failure modes that are high contributors to unreliability, tradeoffs and mitigation of those failure modes. Applied early in product the design phase, this activity allows for relatively inexpensive and easy design and manufacturing process improvements and, in that manner, achieving considerable improvement of the product reliability before the design is completed or the product is manufactured. A real example of this analysis as applied to audio products are discussed along with the achieved reliability improvement.
Milena Krasich
Milena Krasich is the Senior Technical Lead of Reliability Engineering in Design Assurance Engineering of Bose Corporation. Before joining Bose, she was a Member of Technical Staff in the Reliability Engineering Group of General Dynamics Advanced Technology Systems formerly Lucent Technologies, and prior to that, she worked for the Jet Propulsion Laboratory in Pasadena, California. While in California, she was a part-time professor at the California State University Dominguez Hills, where she taught graduate courses in System Reliability, Advanced Reliability and Maintainability, and Statistical Process Control. At that time, she was also a part-time professor at the California State Polytechnic University, Pomona, teaching undergraduate courses in Engineering Statistics, Reliability, Environmental Testing, Production Systems Design, Measurements, and Materials Procurement. She holds a BS and MS in Electrical Engineering from the University of Belgrade, Yugoslavia, and is a California registered professional electrical engineer. She is also a member of the IEEE and ASQC Reliability Society, a Fellow and the past president of the Institute of Environmental Sciences and Technology, and a member of the College of Fellows of the Institute for Advancement of Engineering. Currently, she is a US Delegate to the International Electrotechnical Committee, IEC, working on dependability/Reliability standards and is a project leader for revision of international standards for reliability growth.
Table of Contents 1. INTRODUCTION.......................................................................................................................................... 1
1.1 Notation and Acronyms ................................................................................................................................. 1 2. Reliability Improvement................................................................................................................................. 1
2.1 Reliability Definitions Related to This Tutorial ............................................................................................. 2 3. Fault Tree Analysis and Its Use ..................................................................................................................... 2
3.1 Fault Tree – Introduction ............................................................................................................................... 2 3.2 System Analysis Methodology....................................................................................................................... 2 3.3 Building of a Fault Tree ................................................................................................................................. 6 3.4 Contribution of Manufacturing Defects ......................................................................................................... 8 3.5 Origin of Values for the Basic Events............................................................................................................ 9
4. Failure Mode Detection and Mitigation ......................................................................................................... 9 5. Summary and Conclusions........................................................................................................................... 12 6. References and Bibliography ....................................................................................................................... 12 7. Attachment -Tutorial Visuals ....................................................................................................................... 13
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 1
1. INTRODUCTION Multiple methods have been used for the estimation of product reliability for many decades that reliability has been applied as a science. Many reasons, such as product criticality (medical devices, defense systems, transportation) or the need for competitiveness in consumer industry, dictate the need for products with remarkably high reliability.
Design alone, regardless of its features and technology, does not guarantee products reliability. A design team, conscious of good and reliable design methods such as proper component derating, ESD and EMI protection, may not be completely aware of all of the aspects of reliability modeling and potential reliability shortfalls. This is especially the case when a product must be designed to operate in multiple environments, or the specifics of component reliability aspects (such as dependency of their reliability on applied stresses) are not well understood. Therefore reliability of a completed design may not be as required or as expected.
In the past, attempts to improve product reliability were concentrated on various types of the Failure Mode and Effects Analyses (FMEA), and/or on the dedicated Reliability Growth test programs. Both of those methods applied individually or in conjunction, even though useful, may not be cost effective or applicable.
The first method, FMEA, is a valuable but a very comprehensive attempt to identify the potential failure modes and to assure their mitigation. Starting from the bottom and going up, the analysis addresses each component (electrical or mechanical), the modes in which it might fail, and the effects that those failure modes might have on higher level assemblies and the system. The process is very tedious and is often completed well after the design is finished and the production period has begun. This might be too late to accomplish any measurable improvements without major expenses for redesign, new PC boards layouts, and new tooling. In addition, any type of FMEA normally does not produce the measure of overall product reliability, thus any achieved reliability improvement is also not measurable. One type of a FMEA has a Risk Priority Number (RPN) associated with it; however, this number is a product of three numbers (from 1 to 10) assigned each, Severity, Occurrence, and Detection. Regardless of strict rules applied in estimation of these numbers, those are still only estimations, and thus might be subjective. Another FMEA type that includes criticality computation (FMECA) requires knowledge of failure rates; therefore, it cannot be applied for analysis of systems with components where failure probabilities, not failure rates are a far better attribute. Those also do not provide reliability estimates.
Test methods for reliability improvement are even more costly keeping in mind that those were performed on pre-production or production runs, meaning that the design is mature. In addition, the test units might be complex and expensive so that only a limited number might be available for testing.
Fault Tree Analysis combines many favorable aspects:
• It is timely, therefore. low cost • It is fast and easy to use • It provides realistic reliability estimates at the same time
with the failure mode analysis • It measures achieved reliability improvement and the
final reliability of a product.
1.1 NOTATION AND ACRONYMS
λ(t) - Component failure rate, instantaneous failure rate λ Component failure rate if assumed constant assumed to be constant. ESD - Electrostatic Discharge EMI - Electromagnetic Interference FTA - Fault Tree Analysis FMEA - Failure Mode and Effects Analysis FMECA - Failure Mode Effects and Criticality Analysis RPN - Risk Priority Number MTTF - Mean Time to Failure MTBF - Mean Time Between Failures IEC - International Electrotechnical Commission Q(t) - Unreliability as a function of time Q - Unreliability assumed constant or calculated for a predetermined time Pr - Probability Pr(c) - Probability of occurrence of a cut set FET - Field Effect Transistor IC - Integrated Circuit R - Reliability F- Probability of failure – unreliability CODEC - Coder/Decoder PRF - Part Random Failure PCB - Printed circuit board IEV – International Electrotechnical Vocabulary
2. RELIABILITY IMPROVEMENT Reliability improvement can be undertaken and achieved in
different phases of the product life: • Design phase • Product validation phase – test reliability growth • During its fielded life
The first option, design phase, offers the most cost effective opportunities for product reliability improvement. Before design is finalized, even considerably involved changes do not pose a great expense, other than design time. If design improvements are not excessively extensive, necessary changes can often be painlessly done. Then the rest of product preparation (such as layout of printed circuit boards, tooling, component procurement) can be done without interruption or modifications. In the design phase, reliability improvements are achieved by identification of potential design deficiencies or potential manufacturing problems/defects that may compromise reliability of a design. Some potential design flaws that are likely to be identified are as follows:
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 2
• Electrical or mechanical overstress of components • Components inadequate to be used in that design
(unreliable or improperly used) • Potential relationship between failures, that is, secondary
failures caused by occurrence of another failure or by the presence of an environmental stress
• Parts of inferior quality (reliability) as built by their respective manufacturers.
2.1 RELIABILITY DEFINITIONS RELATED TO THIS TUTORIAL
To assure proper understanding of the terms as they are used in this tutorial, some reliability definitions are included. These are as follows:
Reliability – probability that an item can perform a required function under given conditions for a given time interval (IEV 191-12-01). Here, the required function is defined by expected performance that may vary depending on the use of the item and of the expectations. For a high-fidelity stereo audio/video product, the expectations are, for example, no audible noise or distortion. For a mechanical device, a pipe or an underwater connector housing, the expected performance would be that there is no bending greater than a predefined angle under some expected force. The measures for reliability or its complement, unreliability, would be probability of survival past the end of a predetermined period, or probability of failure before the end of a predetermined period, respectively. The measurement that is best understood by management is the percent of items surviving a time period (life or warranty).
Failure – the termination of the ability of an item to perform a required function (IEV: 191-04-01). A failure can be classified as a failure of the hardware to operate properly due to: • Design failure – a failure due to the inadequate design of
an item to withstand operational and/or environmental stresses, or due to the use of an improper part
• Manufacturing defect causing time-related failures that compromise design reliability
• Software interactions with hardware A failure of an item can also be attributed to a fault in the software code – a failure of the software design. Failure Cause – the circumstances during design, manufacture, or use which have led to a failure (IEV:191-04-01) Failure Mechanism – the physical, chemical, or other process which led to a failure. An example would be crack propagation through the dielectric of a ceramic capacitor causing the capacitor to develop a small resistance and ultimately a short circuit. Failure Mode – manner or state in which an item or a component might fail. Examples of failure modes are: • Low or no output from an IC • Separation of the IC packaging material
• Capacitor fails short due to crack propagation • Resistor fails open due to the poor welding of the
connections • FET saturates and overheats • Seal leaks, etc. One failure mode can have multiple causes. Examples of those are: IC enclosure fails due to one or more of the following:
• high humidity • high temperature • thermal cycling • IC manufacturing process
Capacitor short: • electrical overstress • high temperature, use or soldering • vehicle vibration
A seal in underwater cable connector may leak due to: • water pressure causing dilatation of the material • cold temperature • wearout from mating and de-mating of the
connector • defect in manufacturing – undersize
3. FAULT TREE ANALYSIS AND ITS USE
A fault tree is used as a Boolean representation of a product design; a system, its assemblies and functions, failure modes, and their respective causes. Fault tree analysis in analysis of a design has a multiple mission. One of its applications is for modeling of the product’s architecture and functionality in a top down manner, searching for potential failure modes and their causes that might produce an unfavorable outcome defined as a product failure. It also estimates quantitatively reliability of an item and its assemblies. Based on this information, one can identify those failure modes that are the highest contributors to the product’s unreliability, follow the investigation down to identify their respective causes. This allows for tradeoff and mitigation of those potential failure modes, and finally, evaluation of the achieved reliability improvement.
3.1 FAULT TREE – INTRODUCTION
Fault tree is a logic diagram that represents functional dependencies of parts of a system. The top gate represents the unfavorable outcome of the system, and all other unfavorable outcomes that contribute to the system failure are represented as gates, logically connected to the top gate.
Components of a fault tree are: • Gates, which are outcomes of one or a combination of
input events or other gates • Cut sets, which are groups of outcomes or events that, if
occurred, would cause a system failure. Minimal cut set contains the minimum number of events that are required for a failure outcome. The removal of one of them would result in a system surviving. Types of events and
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 3
gates along with their definitions and graphical representations are shown in Table 3.1. Table 3.1. Graphical Representation and Definitions of Gates and Events FTA Symbol Symbol Name Description Reliability Model Inputs
BASIC EVENT Basic event for which reliability
information is available Component failure mode, or a
failure mode cause 0
CONDITIONAL
EVENT Event that is a condition of
occurrence of another event when both must occur for the output to occur
Occurrence of event that must occur for another event to occur
Conditional probability
0
DORMANT EVENT
A basic event that represents a dormant failure
Dormant component failure mode or dormant failure cause
0
UNDEVELOPED EVENT
A part of the system that yet has to be developed - defined
A contributor to the probability of failure. Structure of that system part is not yet defined
0
TRANSFER GATE
Gate indicating that this part of the system is developed in another part or page of the diagram
A partial reliability block diagram that is shown in other location of the overall system
0
OR GATE This output event occurs if any of its input event occur
Failure occurs if any of the parts of that system fails - series system
≥ 2
MAJORITY VOTE GATE
This output occurs if m of the inputs occur
Redundancy k out of n, where m = n-k+1
≥ 3
EXCLUSIVE OR The output event takes place if one, but not the other input occur
A failure of the system occurring only if one, not both of the two possible failures happens
2
AND GATE The output event takes place if all of the input events occur
Parallel redundancy, one out of n equal or different branches.
≥ 2
PRIORITY AND The output event (failure) occurs only if the input events occur in sequence from left to right
Good for representation of secondary failures or for enabling sequence of events
≥ 2
INHIBIT GATE The output occurs only if both of the input events take place, one of them conditional
Conditional probability of occurrence of the final event
2
NOT GATE The outcome is present only if the input event does not occur
Exclusive events or preventive measure does not take place
1
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 2
3.2 SYSTEM ANALYSIS METHODOLOGY
3.2.1 Classical System Reliability Analysis
When a system is “complex” regarding the complexity of its modeling, that is, if it contains many of interlocked or common branches, standard modeling can become extensively cumbersome, lengthy, and subject to mathematical (computational) errors. An example of a simple, yet “complex” bridge circuit is shown in Figure 3.2-1.
Figure 3.2-1. Bridge Circuit
In the bridge circuit above, the signal must flow from
input A to output B. It can flow through block 3 in both directions. Analytical solution would be to model the system under two circumstances, assuming that the block 3 is good, in which case the signal would flow through blocks 1 or 2 and 4 or 5, as if they were parallel blocks, or assuming that the block 3 is bad (the condition that 3 failed), in which we have blocks 1 and 4 in series, parallel to blocks 2 and 5 also in series. This would be represented with the following equation:
[ ] ( )991.0R
R1RRRRRRRRR)RRRR()RRRR(R
s
354215241
354542121s
=
−⋅⋅⋅⋅−⋅+⋅
⋅⋅−+⋅⋅−+=
When a system contains a multitude of “complex” systems of different kinds, the algebraic representation becomes rapidly too involved and cumbersome to solve. In addition, these complex equations need to contain a multitude of conditional probabilities to account for environmental effects and secondary failures. This only adds to already extensive complexity of the calculations.
3.2.2 System Reliability Analysis Using a Fault Tree The “complex” system shown in Figure 3.2-1 can be
easily modeled using Boolean algebra with fault tree or success tree representation.
Cut sets in this system would be made of the following combinations:
• Blocks 1 and 2 (c1 = 1,2)
• Blocks 4 and 5 (c2 = 4,5) • Blocks 1, 3, and 5 (c3 = 1,3,5) • Blocks 2, 3, and 4 (c4 = 2,3,4)
Should any of the above combinations fail, the signal flow from A to be will be interrupted.
With Boolean algebra, the probability of the system failure would be:
( )4321 ccccPrFS ∪∪∪= Probability of the cut set 1 is:
( ) ( )21211 11 RRFF)cPr( −⋅−=⋅= The correct calculation (Esary-Proschan) is then:
( )[ ] [ ] [ ] [ ])cPr()cPr()cPr()cPr(
ccccPrFS
4321
4321
11111 −⋅−⋅−⋅−−=∪∪∪=
With RARE event approximation; this calculation would be:
5325315421
4321
FFFFFFFFFFF)cPr()cPr()cPr()cPr(F
S
S
⋅⋅+⋅⋅+⋅+⋅=+++=
While easy to implement, RARE approximation may introduce sizeable errors into calculations when the failure probabilities are larger numbers. Anything larger than a multiple of 10-2 as a value of a failure probability will produce an unwanted error. This is shown on the example below:
( ) ( ) ( ) ( )
34325315421
34325315421
15
24
23
22
21
10089
100689
11111
103
1052
108
105
102
−
−
−
−
−
−
−
⋅=
⋅⋅+⋅⋅+⋅+⋅=
⋅=
⋅⋅−⋅⋅⋅−⋅⋅−⋅⋅−−=−⋅=
⋅=
⋅=
⋅=
⋅=
.F
FFFFFFFFFFF:RARE
.F
FFFFFFFFFFF:oshanPrEsary
F
.F
F
F
F
SR
SR
S
S
Software packages commercially available for FTA are
based on Boolean algebra, and most of them contain the constant failure rate model for unavailability:
( )( )te)t(Q ⋅µ+λ−−⋅
µ+λλ
= 1
If the time to repair (MTTR) is considered infinite (non-
repairable items), then µ = 0, and: Q(t) = F(t) Other information that can be obtained with FTA
software is: • Failure frequency (hazard rate) of all gates
1 4
3
2 5
A B
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 3
• Number of expected failures during the predetermined time
• Unavailability or probability of failure of the system at any gate
• Gate summaries in various forms • Confidence intervals • Sensitivity analysis
• Calculations using distributions other than exponential
The circuit from Figure 3.2-1 represented by FTA is shown n Figure 3.2-2
Figure 3.2–2. FTA Diagram of the Bridge Circuit
Different gates of a fault tree are used to represent different circuit models as shown in the following examples:
Example 1: Combination of series and redundant blocks (events)
Reliability block diagram of this combination is shown in Figure 3.2-3
F1 F2
F3
F3
F3
Gate 2
Gate 1
Top Gate
2 out
Figure 3.2-3 Series – Parallel Circuit Configuration
The corresponding equations are as follows: F1 = 0.002, F2 = 0.0005, F3 = 0.0032, n = 3, m = 2
( ) ( )211 111 FFFGate −⋅−−=
)in(iGate F)F(
)!in(!i!nF −⋅−⋅−⋅
= ∑ 332 1
( ) ( )[ ]3
21
10532
111−⋅=
−⋅−−=
.F
FFF
TopGate
gateGateTopGate
The FTA representation of the reliability block diagram in Figure 3.2-3 is shown in Figure 3.2-4
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 4
Figure 3.2-4. FTA Representation of a Series-Parallel
Reliability Block Diagram With different redundant blocks (Figures 3.2-3 and 3.2.-
4) the redundant gates are different, F3, F4, and F5 instead of the repeated F3 and the calculations are done in a similar way (binomial). The three different redundant blocks are shown with the Example 2 of the conditional probability, where Gate 2 has three different gates representing the three redundant blocks (Figure 3.2-5).
Example 2: Use of a priority gate. The event F2 will occur only if the event F1 has occurred (conditional probability). The equivalent fault tree is shown in Figure 3.2–5.
Figure 3.2– 5 Example of a Priority Gate (Gate 1)
The associated mathematics is as follows: n= 3, m = 2
F1 = 0.002, F2 = 0.0005, F3 = 0.00045, F4 = 0.00053, F5 = 0.0032
211 FFFGate ⋅=
5453432 FFFFFFFGate ⋅+⋅+⋅=
( ) [ ][ ]6
21
103744
111−⋅=
−⋅−−=
.F
FFF
TopGate
GateGateTopGate
Example 3: A real life example of a priority gate is the analysis of a switching amplifier, where on all four outputs (+1, - 1, +2, and -2) there are four switching FETs, followed by noise and EMI filtering. For the FETs to operate properly (in the switching mode) the Logic Ground (LGround) must be maintained at a certain voltage. This voltage is 5V, maintained by a voltage regulator filtered by two ceramic capacitors. Should LGround voltage decrease below 2V, the FETs will start operating in linear mode and will then saturate. This condition not only constitutes a failure, but could eventually cause the FET to overheat. This voltage would decrease in the event that one of the voltage filtering capacitors developed a small resistance – close to a short. Here, the Lground below 2V is the condition for FETs to saturate and overheat.
In the old design, voltage-filtering capacitors had a dielectric with Y5V characteristics, which has a higher concentration of voids and could develop and propagate a crack easier than other ceramics (especially in harsher environments as the one that this analysis was performed for). This characteristic, along with the less than adequate voltage rating contributed to a relatively high projected probability of failure for the specified lifetime. Replacement of both of the voltage filtering capacitors with those having a dielectric with X7R characteristics and a higher voltage rating, the 10-year probability of occurrence of FET overheat was reduced from 2.0969E-3 (per FET) to 1.0009E-4, which was an improvement by a factor of 20.
The original circuit, as modeled with the fault tree is shown in Figure 3.2-6.
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 5
I E
FET4 OVERHEATQ=2.0969e-3
Overheat ofFET due toLGND <2V
Page 1
I E
Short C905 Q=3.0001e-3
Ceramiccapacitor shorts
to ground
I E
Short_C906Q=3.0001e-3
Capacitor shortsbrings Lgroundto the ground
I E
MFG_SHORT_C905 Q=7.0000e-8
Manufactirungdefects cause
a short
PRF_SHORT_C6905
Capacitor shortsdue to part
random failureI E
Q=3.0000e-3
I E
MFG_Short_C906 Q=7.0000e-8
Manufactirungdefects cause
a short
I E
DANDREIC SHORTQ=4.6875e-13
Electrolyte mixedwith debris causing
making a short
PRF_C6906
Capaci tor failsdue to part
random fai lureI E
Q=3.0000e-3
SOLDER SHORT_C6906
Excessive soldercausing a short
between the pins orpads
I E
Q=5.0000e-8
DEBRIS_C6906
Debris on thePCB causing a
shortI E
Q=2.0000e-8
I E
LGROUND Q=5.9911e-3
LGND shorted toground causing
improper FET biasand overheat
FET 4 SATURATION
FET saturatesdue to LGND
<2VI E
Q=3.5000e-1
SOLDER SHORT_C6905
Excessive soldercausing a short
between the pins orpads
I E
Q=5.0000e-8
DEBRIS_C6905
Debris on thePCB causing a
sho rtI E
Q=2 .0000e-8
AGEING
Capacito leakingelectroly te due
to ageingI E
Q=1.0000e-6
HI-TEMP
Electroly te Leakdue to HighTemperature
I E
Q=1.2500e-7
HI_HUMIDITY
Electrolyteleak due to
high humidityI E
Q=2.0000e-6
I E
EL. CAP LEAKQ=3.1250e-6
Short caused byleaking of the nearby
capacitor
DEB RIS
resence ofdebris on the
boardI E
Q=1.5000e-7
Figure 3.2-6. Practical Example of a Priority Gate
Example 4. Use of an inhibit gate is shown in Figure 3.2-7. With the inhibit gate, for the outcome to constitute a failure, all of the input events (in our case three) must take place. A practical example of this modeling would be the connection of three EMI filtering capacitors. If a failure mode is defined as no filtering, all of the three would have to fail.
Figure 3.2-7. Example of an Inhibit Gate
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 6
3.3 . BUILDING OF A FAULT TREE
To build a fault tree of a product (a system made of subsystems, assemblies, and components) is a top down process where, as a first step, one must define what constitutes the failure of that product. For a high quality audio amplifier, anything that the end user might hear and qualify as degraded performance constitutes the system failure.
The next step is to outline the system architecture and the major functions such as:
• Power supply • Video amplifier • Audio amplifier
The further analysis going down determines what phenomena preclude proper operability of those parts or functions, i. e:
• Shorted line voltage or no VCC supplied • No video processing • One or more audio channels not operational
More detailed analyses further determine the causes of those phenomena, contributing factors, down to the causes of failure modes such as:
• Electrical overstress • High humidity
• High temperature A detailed example of how a fault tree analysis is done is
shown in another real life example, an analog input to an analog to digital converter of an audio amplifier. The partial circuit of this amplifier is shown in Figure 3.3 – 1. This part of the amplifier is normally known as CODEC, as analog input signals are converted into a digital, and then again into linear output. The signals are directed into an IC that is an analog to digital converter.
For the amplifier to be operational, all signals have to be processed by CODEC meaning that is they have to coded and decoded. The inputs signal 1+ into the left channel of IC U20 is interrupted if:
• R200, R209, or C171 fail open • C179 shorts to ground, shorting the signal to ground The input signal 2+ into the right channel of the U20 is interrupted if: • R201, R205, or C172 fail open • C177 shorts the signal to ground The entire circuit will not work if no voltage is supplied
to the analog input, (pin 8) • R206 or R208 fail open, interrupting the supply of
2.3 V
Figure 3.3-1 Input into CODEC of an Audio Amplifier
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 7
The signal will be too noisy if C183 fails open (low frequency noise), or C181 fails open (high frequency noise). Other contributors to the failure are the lack of data inputs, which will not be considered in this example.
The top level of the FTA representation of this analysis is shown in Figure 3.3–2.
Circled, in Figure 3.3.-2 is the gate that needs to be developed for the analog inputs 1 and 2 described earlier. Figure 3.3-3 shows further development of that gate.
Figure 3.3-2 Top Level FTA of CODEC
Figure 3.3-3 Development of the FTA for inputs 1 and 2
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 8
Inputs 1 and 2 are then separately analyzed, and so are the noisy or no analog voltages. Development of Input 1 is shown in Figure 3.3-4. The circle points out the open components that are to be further developed. The fault tree part in Figure 3.3-4 also contains a gate that points to the possible lack of the 2.3V voltage. Capacitor C179, if failed short, would short the signal to the ground. There are two possible reasons for this capacitor to fail short. One is so called “part random failure”. This term takes into
consideration the environment that the capacitor is supposed to be exposed to (temperature, vibration) as well as the operational stresses that the capacitor will see, such as its operating voltage. Thus, the term “random failure” actually is not just a failure that will occur at random, but it describes the likelihood that a part will fail, if having an intrinsic defect, under given environmental and operational stresses.
Figure 3.3-4. Development of the FTA Down to Components and Their Failure Cause
3.4 CONTRIBUTION OF MANUFACTURING DEFECTS
Manufacturing defects causing time dependent failures are a vital contributor to product unreliability.
Some contributions to components failing open are: • Cold or insufficient solder, which after a period
of time, due to relaxation and fatigue, causes connections to open. Vibration of a vehicle will cause the cold soldered joint to open as well.
• Missing components • Components cracked during insertion • Broken or bent pins or leads
Contributors of manufacturing flaws to components failing short are:
• Debris (at times un-cleaned flux) left on the board
• Excessive solder
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 9
• Bent pins (mostly ICs and connectors) shorting to another pin.
Another reason for the capacitor failure (Figure 3.3-4) would be a failure (a short) caused by manufacturing defects. Normally during production, if a PC board is not properly cleaned, debris left on it will produce so called dandreic growth, which, in turn might cause a short between terminals. A second manufacturing defect causing an electrical short is a result of inadequate soldering technique, where excessive solder develops a bridge between the terminals and cause a short. Further development of the fault tree will point out to other components failing open or short causing failure of the analog power supply, or interruption of the second signal.
3.5 ORIGIN OF VALUES FOR THE BASIC EVENTS
To be able to estimate the final (top gate) product reliability, each of the events must have information on its reliability assigned to it. This information may be attached in the form of a failure rate, MTTF, or probability of failure. For mixtures of hardware, mechanical and electrical, perhaps the most straight forward way would be to represent all the information in the form of a probability of failure calculated for a predetermined time, and a predetermined operational profile.
For electrical components, data for event and failure mode probabilities comes from:
• Information from the manufacturers’ life testing, which needs to be recalculated for the proper environmental and electrical stresses
• Software databases (commercially available) • Field use – field failure data information, which
would be the very last resort because of many inconsistencies of data reporting and recording.
For mechanical components, probability of failure needs to be calculated based on:
• Stresses – loads, and their geometry and distribution • Materials • Construction (design) of parts, such as shape and
size • Attachment of parts to other structures (adhesives,
fasteners) Based on all the information, the safety margin needs to
be calculated, which in turn will produce a reliability value. For determination of a probability of occurrence of
manufacturing defects, the approach may be two-fold. The probability associated with the manufacturing defects can come from factory or service data (field failure data). On the other hand, sometimes it is advisable to fill in the requirements numbers into the reliability analysis, and then adjust the manufacturing process control to achieve this goal.
4. FAILURE MODE DETECTION AND MITIGATION
In a completed or in a partially completed fault tree analysis of a system, when the probability of failure of the top level gate is calculated and it is concluded that reliability improvement is necessary, the process that follows is to identify the highest contributor to unreliability (a failure mode or a cause) and improve the design. This process continues in search for the next highest contributor. An example of such reliability improvement is shown in the case of a complex audio/video amplifier system. The top level of the system (the console) is shown in Figure 4 -1. The Tuner is shown as an event because of the repeated reference designator numbers in the bill of material of the system, and the tuner. For that reason, the Tuner was analyzed separately, and then its top unreliability is depicted as an event.
Figure 4–1 Top Level Fault Tree of Console and its Major Subsystems
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 10
For the given warranty period, the original unreliability
value is not acceptable, as 7,365 systems out of 100,000 made would need service before the end of their respective warranty periods. The highest contributor to unreliability is the block marked SPFIF. This gate was developed on page 13, as it is shown in Figure 4-2.
Figure 4–2 SPDIF Top Level fault Tree. Looking for the highest contributor to the SPDIF circuit unreliability is shown as a part of the circuit that is an input or output from the multiplexer. Further investigation leads to the SPDIF multiplexer, where the highest contributor is the IC U501 (Figure 4-3).
The high failure probability of this IC is related to its construction – packaging (TSSOP). In another package, SOIC, this IC is a reliable part. There were 3 of these units in the console. It also was apparent that the probability of failure of capacitors C513 and C517 was too high for ceramic capacitors. This is because those had the Y5V material dielectric characteristic. There were about 116 capacitors of this type in the console.
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 11
Figure 4-3 The Components which were the Highest Contributors to the Console Unreliability.
Once the design improvements were made, the console reliability was improved to the point of almost meeting its aggressive goal. The resultant improvement is shown in Figure 4-4.
Figure 4-4 Console Reliability Goal, Planned Growth Curve, and the Actual Reliability
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 50 100 150 200 250 300
Duration of the design period (days)
Con
sole
Rel
iabi
lity
Console goal R(1 year) = 0.992
Planned Reliability Growth
Achieved Reliability Growth
Y5V caps replaced by X7R (116)
Initialy calculated
TSSOPs replaced by SOICs
Transistors and FETs from a more reliable vendor
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 12
5. SUMMARY AND CONCLUSIONS The Fault Tree Analysis can be successfully used for
identification and mitigation of potential failure modes that contribute to unreliability of a product.
The FTA allows pictorial representation of the system, its architecture and functionality, along with using Boolean algebra and the multitude of modeling schemes to best represent the system operation and interdependency of its failure modes. The FTA is here used to evaluate the individual failure mode contributions to the system unreliability and come up with the most viable solution for its reliability improvement. The methodology can be summarized as follows:
• Define what constitutes the system failure • Start with the top level of the system with an
unfavorable outcome that defines the system failure • Construct the fault tree down, using logic to express
reliability modeling techniques • Follow the analysis down the fault tree to determine
what assembly, signal, part, or manufacturing defect will cause a particular failure
• Develop the fault tree all the way down to the causes of pertinent failure modes
• Determine respective probability of occurrence of individual causes. The software, when used for analysis, will roll up all information producing the system, subsystem, and assemblies’ failure probability
• Identify those failure modes that are the highest contributors to unreliability and mitigate.
• Update the analysis, and monitor the resultant reliability improvement
Failure mode analysis with fault trees can be started with the start of a project, and updated as more detailed information becomes available.
There is no need to come up with the failure rates as a reliability measure for all components, electrical, mechanical, and software. The fault tree modeling allows a mixture of various information (failure probability, different failure distributions), and does not require estimation of failure rates only like the classical reliability predictions do.
Modeling and reliability assessment of a product – system with the fault tree analysis allows for timely design improvements while design changes are still possible, feasible and inexpensive. This methodology is also described in the draft IEC standards, IEC 60300 – 1, Dependability management. Part 3: Application guide, Section Section 1: Analysis techniques for dependability; Guide on methodology, and IEC61014, Reliability growth methods. The first standard is in its last draft for comments
and vote. The second is also a draft, in circulation for comments.
6. REFERENCES AND BIBLIOGRAPHY 1. Joanne Bechta Dugan, “Fault-Tree Analysis of Computer-Based
Systems” 1999 Tutorial Notes, Reliability and Maintainability Symposium, Washington, DC
2. Kiran Kumar Vemuri and Joanne Bechta Dugan, “Reliability Analysis of Complex Hardware-Software Systems”, Proceedings, Annual Reliability and Maintainability Symposium, January 1999, Washington, DC.
3. Géza Szabó and Péter Gáspár, “Practical treatment Methods for Adaptive Components in the Fault-Tree Analysis”, Proceedings, Annual Reliability and Maintainability Symposium, January 1999, Washington, DC.
4. Alfredo H-S. Ang and Wilson H. Tang “Probability Concepts in Engineering Planning and Design, Volume II, Decision Risk and Reliability”, 1990.
5. Milena Krasich, “Use of fault Tree Analysis for Evaluation of System Reliability Improvements in Design Phase.” Proceedings, Annual Reliability and Maintainability Symposium, January 2000, Los Angeles, California
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 13
7. ATTACHMENT -TUTORIAL VISUALS
Fault Tree Analysis for Product – Reliability Improvement
Milena KrasichBose CorporationJanuary 23, 2002
1-23-2002 M. Krasich 2
Tutorial ContentGeneral reliability definitions in accordance with:
IEC 60050(IEV 191 191) (1990), International Electrotechnical Vocabulary, Chapter 191: Dependability and quality of service
Description of Fault Tree Analysis methodology Mathematics (statistics) associated with the Fault Tree AnalysisReliability modeling of a complex system using Fault Tree Analysis (FTA), in accordance with:
IEC 60300-3-1, Dependability Analysis Methods Examples of how the FTA is used for reliability improvement of electronics Methods for determination of failure probabilities for basic eventsFailure mode mitigation and reliability growth/improvement – a real life example
1-23-2002 M. Krasich 3
Reliability Growth - ImprovementReliability improvement of a product can be achieved in various phases of its life:
Design phaseTest, product validation phase –test reliability growthFielded life – by upgrades, derivatives, recalls, etc.
The most cost effective reliability improvement done during the product designProduct reliability improvement achieved by:
Identification of potential design flaws:Component electrical overstressPotential mechanical overstress and failureInadequate components or parts used Failure of one part caused by the failure of another partUse of parts that are of inferior quality/reliability
Identification of manufacturing problems
1-23-2002 M. Krasich 4
Reliability Definition and ConsiderationsReliability Definition (IEV 191-12-01)
Probability that an item can perform a required function under given conditions for a given time interval Required function: defined by the expected performance, i. e.
No audible noiseNo distortionNo bending pass the predetermined angle
Measures Reliability: Probability of survival after the end of a predetermined periodUnreliability: Probability of failure before the end of the period
Measure as management sees it: Percent of items surviving a predetermined time period –normally warranty period, mission period or other time period requiring proper product operation
1-23-2002 M. Krasich 5
Definition of Failure – IEV: 191-04-01The termination of the ability of an item to perform a required function
Failure of hardware to operate properly due to:Design failure: A failure due to inadequate design of an item –(to withstand operational or environmental stresses) -- improper part or improper use of part in designManufacturing defect causing time - related failures A fault due to non-conformity during manufacture to the design of an item or to specified manufacturing processesSoftware failures
Failure of software Failure Cause
The circumstances during design, manufacture, or use which have led to failure
Failure MechanismThe physical, chemical, or other process which led to a failure
1-23-2002 M. Krasich 6
Definition of Failure ModeFailure mode:
Manner or state in which an item or a component might failExamples:
Low output of an ICSeparation of the IC packaging materialCapacitor fails short due to crack propagation in the dielectric (failure mechanism)Resistor fails open, failure cause – poor lead weldingFET saturation and overheatGain changeSeal leakage
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 14
1-23-2002 M. Krasich 7
Cause of a Failure ModeFailure or failure mode cause –
One failure mode can have multiple causesExamples:
Causes of capacitor short:electrical overstress, high temperature, vehicle vibration, high soldering temperature
Causes of a IC enclosure failure: moisture, high temperature, IC manufacturing process
Causes of a component openpoor soldering, manufacturing – breakage in insertion
Causes of a seal to leak in communication application (under water – ocean bottom)
water pressure causing dilatation, cold temperature, wearout during mating and de-mating, material degradation, manufacturing defect (under-size)
1-23-2002 M. Krasich 8
Use of a Fault TreeFault Tree Analysis (FTA), is a Boolean representation of a system and its assemblies and functions, along with failure modes and their respective causesFTA is used for a multiple mission:
For modeling the Item/system architecture and functionality with a fault tree logic diagram top down to search for potentialfailure modes that might cause an unfavorable outcome defined as a failure of the system and their respective causes To quantitatively estimate the item reliability To identify those failure modes and causes that are the highest contributor to the item probability of failureTo evaluate necessary and possible improvements – trade offTo asses the item reliability improvement as the potential failure modes are mitigated.
1-23-2002 M. Krasich 9
Fault Tree - IntroductionFault tree
A logic diagram representing functional dependencies of parts of a system, and arrangement of events causing unfavorable outcomes - system failure that correspond predetermined failure definition.
Fault tree componentsGates
Outcomes of one or a combination of input eventsCut sets
Groups of events that, if all occur, would cause a system failure.Minimal cut set: contains the minimum number of events that are required for failure. A removal of one of them would result in system not failing.
Events – Basic eventsUsually a failure cause. Gets an assigned value: failure rate, MTBF, or failure probability
1-23-2002 M. Krasich 10
Event
Basic event
Basic event for which reliability information is availableReliability model:
Component failure mode, or a failure mode cause
Conditional event
Event that is a condition of occurrence of another event when both must occur for the output to occurReliability model:
Occurrence of event that must occur for another event to occur
1-23-2002 M. Krasich 11
Events – cont.
Dormant event
A basic event that represents a dormant failure Reliability model:
Dormant component failure mode or dormant failure cause
Undeveloped event
A part of a system not yet developed
1-23-2002 M. Krasich 12
OR gateThis output event occurs if any of its input event occurReliability model: Failure occurs if any of the parts of that system fails - series system
AND gateThe output event takes place if all of the input events occurReliability model: Parallel redundancy, one out of n equal or different branches.
Majority vote gate:This output occurs if m of the inputs occur Reliability model: Redundancy k out of n, where m = n - k+1
Priority AND gate:The output event (failure) occurs only if the input events occur in sequence from left to rightReliability model: secondary failures or for enabling events
Gates
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 15
1-23-2002 M. Krasich 17
I E
Failure Q=9.068e-3
No signal atthe output
I E
Cross 1Q=4.800e-4
Signal notgoing thourgh
the top firstI E
TopQ=1.000e-3
Signal notpassing throughthe top branch
I E
Bottom Q=7.500e-3
Signal not passignthrough the bottom
branch
I E
Cross 2Q=1.000e-4
Signal not passingthrough the bottom
block fir st
1
Block 1 fails
I E
Q=2.000e-2
2
Block 2 failure
I E
Q=5.000e-2
4
Block 4 fai ls
I E
Q=2.500e-2
5
Block 5 fai ls
I E
Q=3.000e-1
1
Block 1 fai ls
I E
Q=2.000e-2
3
Block 3 fails
I E
Q=8.000e-2
5
Block 5 fai ls
I E
Q=3.000e-1
2
Block 2 failure
I E
Q=5.000e-2
3
Block 3 fails
I E
Q=8.000e-2
4
Block 4 fails
I E
Q=2.500e-2
1 4
3
2 5
A B
FTA Model with Esary-Proschan Calculation
1-23-2002 M. Krasich 16
Comparison of the FTA Calculation Methods
4325315421sr
4325315421s
FFFFFFFFFFF
)FFF1()FFF1()FF1()FF1(1F
:tion ApproximaRare
:ns)calculatio(correct Proschan-Esary
⋅⋅+⋅⋅+⋅+⋅=
⋅⋅−⋅⋅⋅−⋅⋅−⋅⋅−−=
F1 2 10 2 F2 5 10 2 F3 8 10 2 F4 2.5 10 2
Esary-Proschan :
Fs 1 1 F1 F2 1 F4 F5 1 F1 F3 F5 1 F2 F3 F4
Fs 9.068 10 3
Rare Approximation :
Fsr F1 F2 F4 F5 F1 F3 F5 F2 F3 F4
1-23-2002 M. Krasich 15
Modeling with a Fault Tree – Boolean AlgebraBasis for the Fault Tree: Boolean algebra, used to produce minimal cut sets (or paths sets)
Cut Sets –System fails if any one of the cut set happens: c1 = 1,2 c2 = 4,5 c3 = 1,3,5 c4 = 2,3,4
FS = Pr(c1∪ c2 ∪ c3 ∪ c4) RS = 1 - FS
[ ] [ ] [ ] [ ]
4325315421S
4321S
43214321
21211
FFFFFFFFFFF)cPr()cPr()cPr()cPr(F
:ionapproximat event Rare)cPr(1)cPr(1)cPr(1)cPr(11)ccccPr(
:Proschan)(Esary ncalculatio Correct)R1()R1(FF)cPr(
⋅⋅+⋅⋅+⋅+⋅=
+++=
−⋅−⋅−⋅−−=∪∪∪
−⋅−=⋅=
1 4
3
2 5
A B
1-23-2002 M. Krasich 14
System Analysis MethodsA “complex” System Reliability Block Diagram (RBD) Example: Failure: No signal flow from A to B
Algebraic solution meaning:Reliability of the system provided that R3 is good, plus reliability of the system provided R3 is bad.
When a system is really complex, with a multitude of interrelationships between the assemblies, the algebraic solutions become rapidly too involved.Environmental factors and manufacturing errors left out.
[ ] )1()()(
354215241
354542121
RRRRRRRRRRRRRRRRRRRS
−⋅⋅⋅⋅−⋅+⋅
+⋅⋅−+⋅⋅−+=
1 4
3
2 5
A B
1-23-2002 M. Krasich 13
Gates – cont.Exclusive OR gate
The output event takes place if one, but not the other input occursReliability model: A failure of the system occurring only if one, not both of the two possible failures happens
Inhibit gate:The output occurs only if both (or all) of the input events takeplace, one of them conditionalReliability model: Conditional probability of the final event
Transfer gate:Gate indicating that this part of the system is developed in another part or page of the diagramReliability reference: A partial reliability block diagram that is shown in other location of the overall system block diagram
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 16
1-23-2002 M. Krasich 22
Example – Partial Schematic of a Switching Amplifier
1-23-2002 M. Krasich 21
I E
TOP1Q=4.374 e-6
fails i f Gate 1OR Gate 2 fai ls
I E
GATE1Q=1.000e-6
Fails only i fEVENT1 occurs
first
I E
GATE2Q=3.374e-6
2
Fai ls if any ofthe two events
takes place
EVENT1
F1
I E
Q=0.002
EVENT2
F2
I E
Q=0.0005
EVENT3
F3
I E
Q=0.000 45
EVENT4
F4
I E
Q=0.00053
EVENT5
F5
I E
Q=0.0032
Priority Gate - ExampleF4 0.00053 F5 0.0032
FGate1 1 1 F1 1 F2
FGate2 F3 F4 F3 F5 F4 F5
FTopGate 1 1 FGate1 1 FGate2
FTopGate 2.502 10 3
Gate 1, Conditional probability:
Probability of occurrence of EVENT1 = F1
Probability of occurrence of event 2 if event 1 occurred = F2
FGate1=F(EVENT1)*F(EVENT2|EVENT1)
1-23-2002 M. Krasich 20
I E
TOP1Q=2.502 e-3
fails if Gate 1OR Gate 2 fails
I E
GATE1Q=2.499e-3
Fails i f event 1OR the event 2
takes place
I E
GATE2Q=3.374e-6
2
Fai ls i f any twoof the eventtakes place
EVENT1
F1
I E
Q=0.002
EVENT2
F2
I E
Q=0.0005
EVENT3
F3
I E
Q=0.000 45
EVENT4
F4
I E
Q=0.00053
EVENT5
F5
I E
Q=0.0032
Example: The Redundant Gates are Different
F1 F2
F3
F4
F5
Gate 2
Gate 1
Top Gate
2 out 3
n 3 m 2
F1 0.002 F2 0.0005 F3 0.00045
F4 0.00053 F5 0.0032
FGate1 1 1 F1 1 F2
FGate2 F3 F4 F3 F5 F4 F5
FTopGate 1 1 FGate1 1 FGate2
FTopGate 2.502 10 3
1-23-2002 M. Krasich 19
Example: Combination of Series and Redundant Events
F1 F2
F3
F3
F3Gate 2
Gate 1
Top Gate
2 out 3
n 3 m 2
F1 0.002 F2 0.0005 F3 0.0032
FGate1 1 1 F1 1 F2
FGate20
m 1
i
n i n i ( )
1 F3 i F3n i ( )
FTopGate 1 1 FGate1 1 FGate2
FTopGate 2.53 10 3
I E
TOP1 Q=2.530e-3
fails if Gate 1OR Gate 2 fails
I E
GATE1 Q=2.499e-3
Fails if event 1OR the event 2
occur
I E
GATE2 Q=3.072e-5
2
Fails if 2 of thethree eventstake place
EVENT1
F1
I E
Q=0.002
EVENT2
F2
I E
Q=0.0005
EVENT3
F3
I E
Q=0.0032
EVENT4
F3
I E
Q=0.0032
EVENT5
F3
I E
Q=0.0032
1-23-2002 M. Krasich 18
FTA Representation of the RBD – RARE Approximation
I E
Failure Q=9.080e-3
No signal atthe output
I E
Cross 1 Q=4.800e-4
Signal notgoing thourgh
the top firstI E
Top Q=1.000e-3
Signal notpassing throughthe top branch
I E
Bottom Q=7.500e-3
Signal not passignthrough the bottom
branch
I E
Cross 2 Q=1.000e-4
Signal not passingthrough the bottom
block fir st
1
Block 1 fails
I E
Q=2.000e-2
2
Block 2 failure
I E
Q=5.000e-2
4
Block 4 fails
I E
Q=2.500e-2
5
Block 5 fails
I E
Q=3.000e-1
1
Block 1 fails
I E
Q=2.000e-2
3
Block 3 fails
I E
Q=8.000e-2
5
Block 5 fails
I E
Q=3.000e-1
2
Block 2 failure
I E
Q=5.000e-2
3
Block 3 fails
I E
Q=8.000e-2
4
Block 4 fails
I E
Q=2.500e-2
1 4
3
2 5
A B
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 17
1-23-2002 M. Krasich 27
Building a Fault TreeDefine the systemDefine its major parts or functions, I. e.:
Power supplyVideoAudio channels
Determine what phenomenon precludes proper operability of those parts or functions, i. e.
Shorted line voltage or no VCC suppliedNo videoOne or more audio channels not operational
Determine the causes of those phenomenaDetermine the contributing factors to the causes, i. e.
High temperatureHigh humidityElectrical overstress
1-23-2002 M. Krasich 26
Other Important Information from an FTA SoftwareFailure Frequency (hazard rate of all gates)Number of expected failures during the preset lifetimeUnavailability (or availability) of the system or any gate (function or assembly), provided the system is assumed repairableGate summary in various formsConfidence intervals on provided information (failure probability or unavailabilitySensitivity analysis – the most critical component variation in probability of occurrenceResults from failure distributions other than exponential (constant failure rate)Results calculated with multiple simulations (we normally set the number of simulations to 10,000)
1-23-2002 M. Krasich 25
I E
TOP1Q=1.001e-6
fai ls i f Gate 1OR Gate 2 fai ls
I E
GATE1Q=1.000e-6
Fails only ifEVENT1 happensbef ore EVENT2
I E
GATE2Q=7.632e-10
Fails i f al l ofthe events take
place
EVENT1
F1
I E
Q=0.002
EVENT2
F2
I E
Q=0.0005
EVENT3
F3
I E
Q=0.00045
EVENT4
F4
I E
Q=0.00053
EVENT5
F5
I E
Q=0.0032
Inhibit Gate - ExampleGate 1, Conditional
probability:
Gate 2, Inhibit:Outcome occurs only if all
three (or any number) of events – or gates – take place.
Example: Three EMI protection capacitors in parallel.
No filtering if all of the three fail openFGate2 F3 F4 F5
FGate2 7.632 10 10
1-23-2002 M. Krasich 24
I E
FET4 OVERHEAT Q=1.0009e-4
Overheat ofFET due toLGND <2V
Page 1
I E
Short C905 Q=1.4299e-4
Ceramiccapacitor shorts
to ground
I E
Short C906 Q=1.4299e-4
Capacitor shortsbrings Lgroundto the ground
I E
MFG_SHORT_C905
Q=7.0000e-8
Manufactirungdefects cause
a short
PRF_SHORT_C6905
Capacitor shortsdue to part
random failureI E
Q=1.4292e-4
I E
MFG_Short_C906Q=7.0000e-8
Manufactirungdefects cause
a short
I E
DANDREIC SHORT
Q=4.6875e-13
Electrolyte mixedwith debris causing
making a short
PRF_C6906
Capacitor fai lsdue to part
random failureI E
Q=1.4292e-4
SOLDER SHORT_C6906
Excessive soldercausing a short
between the pins orpads
I E
Q=5.0000e-8
DEBRIS_C6906
Debris on thePCB causing a
shortI E
Q=2.0000e-8
I E
LGROUND Q=2.8596e-4
LGND shorted toground causing
improper FET biasand overheat
FET 4 SATURATION
FET saturatesdue to LGND
<2VI E
Q=3.5000e-1
SOLDER SHORT_C6905
Excessive soldercausing a short
between the pins orpads
I E
Q=5.0000e-8
DEBRIS_C6905
Debris on thePCB causing a
shortI E
Q=2 .0000e-8
AGEING
Capacito leakingelectroly te due
to ageingI E
Q=1.0000e-6
HI-TEMP
Electrolyte Leakdue to HighTemperature
I E
Q=1.2500e-7
HI_HUMIDITY
Electrolyteleak due to
high humidityI E
Q=2.0000e-6
I E
EL. CAP LEAK Q=3.1250e-6
Short caused byleaking of the nearby
capacitor
DEBRIS
resence ofdebris on the
boardI E
Q=1.5000e-7
After Capacitor Improvement (0.033 µF replaced 0.1 µF)
1-23-2002 M. Krasich 23
Example of the Priority and AND Gate – Switching Amp Before Improvement
I E
FET4 OVERHEAT Q=2.0969e-3
Overheat ofFET due toLGND <2V
Page 1
I E
Short C905 Q=3.0001e-3
Ceramiccapacitor shorts
to ground
I E
Short C906 Q=3.0001e-3
Capacitor shortsbrings Lgroundto the ground
I E
MFG_SHORT_C905 Q=7.0000e-8
Manufactirungdefects cause
a short
PRF_SHORT_C6905
Capacitor shortsdue to part
random failureI E
Q=3.0000e-3
I E
MFG_Short_C906 Q=7.0000e-8
Manufactirungdefects cause
a short
I E
DANDREIC SHORT Q=4.6875e-13
Electrolyte mixedwith debris causing
making a short
PRF_C6906
Capacitor failsdue to part
random failureI E
Q=3.0000e-3
SOLDER SHORT_C6906
Excessive soldercausing a short
between the pins orpads
I E
Q=5.0000e-8
DEBRIS_C6906
Debris on thePCB causing a
shortI E
Q=2.0000e-8
I E
LGROUND Q=5.9911e-3
LGND shorted toground causing
improper FET biasand overheat
FET 4 SATURATION
FET saturatesdue to LGND
<2VI E
Q=3.5000e-1
SOLDER SHORT_C6905
Excessive soldercausing a short
between the pins orpads
I E
Q=5.0000e-8
DEBRIS_C6905
Debris on thePCB causing a
sho rtI E
Q=2 .0000e-8
AGEING
Capacito leakingelectroly te due
to ageingI E
Q=1.0000e-6
HI-TEMP
Electrolyte Leakdue to HighTemperature
I E
Q=1.2500e-7
HI_HUMIDITY
Electrolyteleak due to
high humidityI E
Q=2.0000e-6
I E
EL. CAP LEAK Q=3.1250e-6
Short caused byleaking of the nearby
capacitor
DEB RIS
resence ofdebris on the
boardI E
Q=1.5000e-7
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 18
1-23-2002 M. Krasich 32
I E
A to D 1 and 2 Q=1.7487e-2
Failure of A to Dconversion f orchannel 1 and 2
Page 5
I E
5 V ANA Q=1.0147e-3
5V Analog notdelivered or
noisyI E
Input 1 into A to DQ=3.1481e-3
Input 1 to CODEC Ato D not available or
too noisy
Page 68
I E
Input 2 into A to DQ=4.2788e-3
Input 2 toCODEC A to Dnot available
Page 125
I E
No 5V Analog Q=7.7798e-4
5V analog notavailable
Page 44
I E
Noise on 5V ANA Q=2.3692e-4
High or lowfrequency noiseintroduced to the
signal
Page 201
I E
A_IN_1_+Q=7.2087e-3
Analog input 1to CODEC not
available
Page 71
I E
A_IN_2_+Q=8.1431e-3
Analog input 2to CODEC not
available
Page 69
I E
Fail_U20Q=1.4721e-3
U20 failure
Page 200
I E
Analog Inputs 1 & 2 Q=5.9081e-3
Analog inputs1 and/or 2 not
avai lable
FTA Representation of CODEC Analysis, cont.
•One of the plus inputs (1 or 2) not provided to the converter;
•No 5V analog supply voltage provided
•IC U20 not operational
Page 30
1-23-2002 M. Krasich 31
I E
Analog Outputs 1 and 2
Q=3.3535e-2
No analog outputf rom CODEC
available
Page 1
I E
A to D 1 and 2Q=1.7487e-2
Failure of A to Dconversion f orchannel 1 and 2
Page 30
I E
Digital f rom U20Q=1.4648e-2
One or more digitaloutputs from U20 not
available
Page 29
I E
D to A for A_OUT_1&2+
Q=1.5972e-2
D to A conversion foranalog outputs 1 and
2
I E
A_OUT_1 and 2Q=5.5314e-4
Analog outputsnot available
I E
D input to U21Q=1.4648e-2
No digital inputprovided for
the U21
Page 27
E
DAC_1_DATQ=0.0000
No dataavailable from
CAD_1I E
Fail_U21 Q=3.2979e-4
U21 failure
Page 198
I E
5 V ANA to U21 Q=1.0147e-3
5V Analog notdelivered or
noisy
Page 43
I E
A_OUT_1Q=2.7661e-4
Analog output1 not available
Page 66
I E
A_OUT_2Q=2.7661e-4
Analog output2 not available
Page 65
FTA Representation of CODEC AnalysisFailure: No analog output from CODEC, one of the
reasons: no analog inputs into it – 1 or 2
Go to page 30 for the
analog inputs
Page 5
1-23-2002 M. Krasich 30
Rationale for Analysis of A to D ConversionInput Circuit
The entire circuit will not work if:No voltage supplied to the analog input (pin 8): Open R206, or R208 (if open – slight non-audible distortion) orshort C174 or C176 (if any of the caps open, no failure)
No 5V analog supplied to pin 7: C 181 or C 183 fail shortU20 fails in whichever mode (low, high, or no output)
There will be no output to the D to A conversion and the rest of the amp if failed open: R214, R215, R218, and R 219 (if shorted – not too much harm)Not all failure modes need to be considered if not important to the failure definition– realistic prediction
1-23-2002 M. Krasich 29
Rationale for Analysis of A to D Input Circuit
For the amplifier to be operational, all signals have to be processed by CODEC – coded and decodedIn CODEC, the analog signal is converted to digital, and then again into analog for the analog outputThe input signal 1+ into the left channel of IC U20 interrupted if:
Components fail open: R200, R209, C171
C179 shorts to ground (shorting the signal)The input signal 2+ into the right channel of IC U20 interrupted if:
Components fail open: R201, R205, C172
C177 shorts to ground (shorting the signal)Opening of C117 might cause some noise, that will be filtered later in the circuit
1-23-2002 M. Krasich 28
Example – Input to CODEC of an Amplifier
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 19
1-23-2002 M. Krasich 37
Contribution of Manufacturing DefectsContribution to components failing open
Cold or insufficient solder: Connection opens over time due to the solder fatigue or vibrations
Missing componentsAmazingly large number of components are not inserted during assembly – detected later when the function exercised
Components cracked during insertionBroken or bent pins or leads
Contribution to failing shortDebris (un-cleaned flux) left on the board that with dandreic growth causes a shortExcessive solderBent pins (mostly ICs and connectors) shorting with another pin
1-23-2002 M. Krasich 36
I E
Noise on 5V ANAQ=2.3692e-4
High or lowfrequency noiseintroduced to the
signalPage 30
I E
MFG_Open_El_C183Q=1.3000e-8
Capacitor connectionsopen due to the
manufacturing def ec t
PRF_Open_El_C183
Capacitor failsopen due to the
part randomfailure
I E
Q=6.18377e-005
Cold solder_El_C183
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_El_C183
Part notinserted during
assembly
I E
Q=1e-009
I E
MFG_Open_C181 Q=1.3000e-8
Capac itor connectionsopen due to the
manufacturing defect
PRF_Open_C181
Capacitor failsopen due to the
part randomfailure
I E
Q=0.000175069
Cold solder_C181
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_C181
Part notinserted during
assembly
I E
Q=1e-009
I E
Open_El_C183Q=6.1851e-5
Open capac itorcauses low
frequency noise
I E
Open_C181 Q=1.7508e-4
Open capacitorcauses high
frequency noise
High or Low Frequency Noise into the CODEC
1-23-2002 M. Krasich 35
I E
No 5V Analog Q=7.7798e-4
5V analog notavailable
Page 30
I E
MFG_Short_C181 Q=7.0000e-8
Connection shortdue to the
manufacturingdefect
PRF_Short_C181
Capacitor failsshort due to the
part randomfailure
I E
Q=0.000274094
Debris_C181
Debris on thePCB causing
dandreic growthand a short
I E
Q=2e-008
Solder_short_C181
Excessive soldercausing a short
between the pinsor pads
I E
Q=5e-008
I E
MFG_Short_El_C183Q=7.0000e-8
Connection shortdue to the
manufacturingdefect
PRF_Short_El_C183
Capacitor failsshort due to the
part randomfailure
I E
Q=9.36354e-005
PRF_Leak_El_C183
Electrolyte leakdue to thecapacitor
random failureI E
Q=1.76564e-005
Debris_El_C183
Debris on thePCB causing
dandreic growthand a short
I E
Q=2e-008
Solder_short_El_C183
Excessive soldercausing a short
between the pinsor pads
I E
Q=5e-008
I E
Short_EL_C183Q=1.1136e-4
The 5V analogshorts to ground,
no voltage forpin 7 of U20
I E
Short_C181 Q=2.7416e-4
Capacitor failsshort, shorting+5V analog to
the groundI E
+5V_ANAQ=3.9264e-4
Voltage notavailable
Page 67
Failure Due to No Analog Voltage
Supply
1-23-2002 M. Krasich 34
I E
Open Comp Q=4.1504e-4
Open componentsinterrupting the
signal or causignnoise
Page 68
I E
MFG_Open_El_C171 Q=1.3000e-8
Capacitorconnections open dueto the manuf acturing
def ect
PRF_Open_El_C171
Capacitor failsopen due to the
part random failure
I E
Q=0.000127767
Cold solder_El_C171
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_El_C171
Part notinserted during
assemblyI E
Q=1e-009
I E
MFG_Open_R200 Q=1.3000e-8
Resistor connectionsopen due to the
manuf acturing def ect
PRF_Open_R200
Resistor fails opendue to the partrandom failure
I E
Q=5.16358e-005
Cold solder_R200
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_R200
Part notinserted during
assemblyI E
Q=1e-009
I E
MFG_Open_R209 Q=1.3000e-8
Resistor connectionsopen due to the
manuf acturing def ect
PRF_Open_R209
Resistor fails opendue to the partrandom failure
I E
Q=5.16358e-005
Cold solder_R209
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_R209
Part notinserted during
assemblyI E
Q=1e-009
I E
MFG_Open_R206 Q=1.3000e-8
Resistor connectionsopen due to the
manuf acturing def ect
PRF_Open_R206
Resistor fails opendue to the partrandom failure
I E
Q=5.16358e-005
Cold solder_R206
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_R206
Part notinserted during
assemblyI E
Q=1e-009
I E
MFG_Open_C179 Q=1.3000e-8
Capacitorconnections open dueto the manuf acturing
defect
PRF_Open_C179
Capacitor failsopen due to the
part random failure
I E
Q=0.000132368
Cold solder_C179
Connection opensdue to insufficient
or inpropersoldering
I E
Q=1.2e-008
Missing_C179
Part notinserted during
assemblyI E
Q=1e-009
I E
Open_C179 Q=1.3238e-4
Open capacitorcauses high
frequency noise onthe input
I E
Open_R206 Q=5.1649e-5
Resistor fails open,+2.3 V not
available for theanalog input
I E
Open_R209 Q=5.1649e-5
Resistor failsopen, signalinterrupted
I E
Open_El_C171 Q=1.2778e-4
Open capacitor interrupts the
signalI E
Open_R200 Q=5.1649e-5
Resistor failsopen, signalinterrupted
Signal Noisy or Interrupted Due to Open Components
Page 140
1-23-2002 M. Krasich 33
I E
Input 1 into A to D Q=3.1481e-3
Input 1 toCODEC A to Dnot available or
too noisyPage 30
I E
Short_C179 Q=2.0730e-4
Capacitor failsshort, shortingsignal 1 to the
groundI E
2.3V Q=2.6595e-3
2.3 V supply
Page 126
I E
Open Comp Q=4.1504e-4
Open componentsinterrupting the
signal or causignnoise
Page 140
I E
MFG_Short_C179 Q=7.0000e-8
Connection shortdue to the
manufacturingdefect
PRF_Short_C179
Capacitor failsshort due to the
part randomfailure
I E
Q=0.000207226
Debris_C179
Debris on thePCB causing
dandreic growthand a short
I E
Q=2e-008
Solder_short_C179
Excessive soldercausing a short
between the pinsor pads
I E
Q=5e-008
Input 1 Not Available
•PRF – Failure of the part –“random”
•Failure probabilities are assigned to the manufacturing process – quality requirement
Page 68
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 20
1-23-2002 M. Krasich 42
FTA Top Level – Audio/Video Console exampleStart from the system top level –Include only those failure modes that affect the system performanceRepresent system architecture – functional, hardware, or mixWhen work completed, look for the highest contributor to unreliability
Console
I E
Postman System Q=7.365e-2
System failureor improperoperation
I E
ANALOG SIGNALQ=1.162e-2
Analog signalnot available
Page 2
I E
Power SupplyQ=3.280e-3
No or improperpower deliveredto the system
Page 12
I E
Video Q=3.221e-3
No video
Page 8
I E
SPDIF Q=4.946e-2
No SPDIFbotth zones
Page 13
I E
FunctionsQ=3.464e-2
Failure of thesefunctions causes
noticeabledifference
Page 11
Tuner
Tuner failure
I E
Q=4.423e-3
1-23-2002 M. Krasich 41
Probability of the Seal Wear The wear or spiral fracture of the Parker Fluorocarbon seals is noticed when the squeeze was 0.017 per side –failure definition for a 0.210” cross sectionAbrasion resistance of Fluorocarbon is determined (Parker Handbook) to be good with the properly determined seal compression (squeeze)Radius of the above seal is found from:
Ratios of the one sided compression and the respective radiuses are:
The probability of the actual seal failure in ten years of life is:
. 2 0.21 0.2585
( ) ( )6
22
21
21 10464.11.03.0
)10( −⋅=
⋅+⋅
−Φ=
rrrryearsF
.r ;rρ
=ρ
=0040017.0
21
1-23-2002 M. Krasich 40
Example of Failure Probability CalculationsAutomotive amplifierLife expectancy: 15 yearsAverage active time (ON) daily: 2.7 hoursAssumptions:
Car stereo ON when driving – automotive or Ground Mobile (GM) environmentCar stereo OFF while car parked – stationary thermally uncontrolled environment (GF) – dormancy appliesComponent probability of failure can be calculated as:
0.1factor dormancy d where dGFGFD
)7.224(15365ONt
7.215365ONt
)OFFtGFDexp()ONtGMexp(1)years15(F
≤=⋅λ=λ
−⋅⋅=
⋅⋅=
⋅λ−⋅⋅λ−−=
1-23-2002 M. Krasich 39
Part of the Failure Mode Probability Worksheetpn desc ref rem fr
Failure mode ratio Failure rate Dormant FR R(Ta) F0 F1
191470-332 CAP,0603,X7R,50V,3300PF C540 PRF_C540 0.0089 8.937E-09 8.937E-10 0.999922 7.8285E-05 7.8285E-06PRF_Short_C540 0.75 6.7028E-09 6.7028E-10 0.999941 5.8714E-05 5.8714E-06PRF_ChValue_C540 0.1 8.937E-10 8.937E-11 0.999992 7.8288E-06 7.8288E-07PRF_Open_C540 0.15 1.3406E-09 1.3406E-10 0.999988 1.1743E-05 1.1743E-06
191470-473 CAP,0603,X7R,50V,.047UF C541 PRF_C541 0.0114 1.1351E-08 1.1351E-09 0.999901 9.943E-05 9.943E-06PRF_Short_C541 0.75 8.5133E-09 8.5133E-10 0.999925 7.4573E-05 7.4573E-06PRF_ChValue_C541 0.1 1.1351E-09 1.1351E-10 0.99999 9.9434E-06 9.9434E-07PRF_Open_C541 0.15 1.7027E-09 1.7027E-10 0.999985 1.4915E-05 1.4915E-06
254110 DIODE,SCHOTTKY,40V,3A,S D803 PRF_D803 0.01 9.95E-09 9.95E-10 0.999913 8.7158E-05 8.7158E-06PRF_Short_D803 0.2 1.99E-09 1.99E-10 0.999983 1.7432E-05 1.7432E-06PRF_Open_D803 0.45 8.955E-10 8.955E-11 0.999992 7.8445E-06 7.8445E-07PRF_ParamCh_D803 0.35 6.965E-10 6.965E-11 0.999994 6.1013E-06 6.1013E-07
135247-5232DIODE,ZEN,5.6V,225MW,5% D306 PRF_D306 0.003 3E-09 3E-10 0.999974 2.628E-05 2.628E-06PRF_Short_D306 0.2 6E-10 6E-11 0.999995 5.256E-06 5.256E-07PRF_Open_D306 0.45 2.7E-10 2.7E-11 0.999998 2.3652E-06 2.3652E-07PRF_ParamCh_D306 0.35 2.1E-10 2.1E-11 0.999998 1.8396E-06 1.8396E-07
147239 DIODE,DUAL,SOT-23,BAW56 D206 PRF_D206 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06PRF_Short_D206 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06PRF_Open_D206 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06PRF_ParamCh_D206 0.2 1.0349E-09 1.0349E-10 0.999991 9.0656E-06 9.0656E-07
147239 DIODE,DUAL,SOT-23,BAW56 D707 PRF_D707 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06PRF_Short_D707 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06PRF_Open_D707 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06PRF_ParamCh_D707 0.2 1.0349E-09 1.0349E-10 0.999991 9.0656E-06 9.0656E-07
147239 DIODE,SWITCHING,75V,200 D702 PRF_D702 0.0172 1.72E-08 1.72E-09 0.999849 0.00015066 1.5066E-05PRF_Short_D702 0.92 1.5824E-08 1.5824E-09 0.999861 0.00013861 1.3861E-05PRF_Open_D702 0.08 1.2659E-09 1.2659E-10 0.999989 1.1089E-05 1.1089E-06
147239 DIODE,SOT-23,BAV 99 D100 PRF_D100 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06PRF_Short_D100 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06PRF_Open_D100 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06PRF_ParamCh_D100 0.2 1.0349E-09 1.0349E-10 0.999991 9.0656E-06 9.0656E-07
147239 DIODE,SOT-23,BAV 99 D101 PRF_D101 0.0101 1.0146E-08 1.0146E-09 0.999911 8.8875E-05 8.8875E-06PRF_Short_D101 0.51 5.1745E-09 5.1745E-10 0.999955 4.5327E-05 4.5327E-06PRF_Open_D101 0.29 1.5006E-09 1.5006E-10 0.999987 1.3145E-05 1.3145E-06
1-23-2002 M. Krasich 38
Values for the Basic EventsElectrical components
Information from manufacturers (life test data)Need to be adjusted for the proper environment and stresses
Software databasesField use (last resort)
Mechanical componentsDetermine stresses - loads (mechanical, environmental)Construct stress/strength equation for multiple loads if requiredCalculate design (safety) margin and reliability (probability of failure) for the required life
Manufacturing defectsFactory dataField failure data
2002 Annual RELIABILITY and MAINTAINABILITY Symposium 21
1-23-2002 M. Krasich 47
The Benefit of FTA for the Design Reliability Growth
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
1.02
0 50 100 150 200 250 300Design Time (Days)
Rel
iabi
lty
System
Subwoofer
Console
If 100,000 systems produced in one year, 9,250 less will be returned for repair within
warranty period as a result of reliability improvement
1-23-2002 M. Krasich 46
Fault Tree Analysis for Reliability Growth - SummaryDefine what constitutes a system failureStart with the unfavorable outcome that defines the system failureConstruct the fault tree down, using logic to express reliability modeling techniquesFollow the analysis: failure of what assembly, signal, or part will cause the particular failure.Develop down to the causes of the pertinent failure modesDetermine probabilities of occurrence of individual causes.Identify the highest unreliability contributor or safety relatedfailure modes and mitigateImprove reliability as necessary and possibleUpdate the analysis, monitor reliability until the goal is met
1-23-2002 M. Krasich 45
Audio/Video Console Reliability Growth Monitoring
0.91
0.92
0.93
0.94
0.95
0.96
0.97
0.98
0.99
1
0 50 100 150 200 250 300
Duration of the design period (days)
Cons
ole
Relia
bilit
y
Console goal R(1 year) = 0.992
Planned Reliability Growth
Achieved Reliability Growth
Y5V caps replaced by X7R (116)
Initialy calculated
TSSOPs replaced by SOICs
Transistors and FETs from a more reliable vendor
1-23-2002 M. Krasich 44
Detailed Failure Modes and Causes Cause 1: TSSOPS
Cause 2 Caps with Y5V
dielectric
1-23-2002 M. Krasich 43
The Highest Contributor to Unreliability - Example
• Follow the highest hitter down to its subassemblies
• Look for the highest contributor to its reliability
Page 13