Upload
tranhanh
View
219
Download
2
Embed Size (px)
Citation preview
Design Reliability 101 - Tips,
Techniques and How to
#eelive Produced by EE Times
Introduction
To ensure design quality regardless of end application the developing
organisation must have a defined engineering governance policy
Engineering Governance
Initiation PlanningExecute
(Doing)Concept
Kick OFF SRR PDR CDR / TRR / TRBExpression of Interest /
RFQ
Project Definition Implementation
SDR
Functional Baseline
Established
Requirements
Baseline
Established
Demonstration of
Preliminary
Design against
Functional and
Requirement
Baselines
Demonstration
of design
maturity
against
Functional and
Requirement
Baselines
Ensures test
articles,
facilities,
equipment
personal and
test procedure
are available
Review of Test
and
qualification
results against
Requirements
baseline
Initiation PlanningExecute
(Doing)ClosureConcept In-Service
Business Case Generation Business Case Delivery In-Service Delivery
IMPACTS
Gate 1 Register of
InterestGate 2 Bid/No Bid Gate 3a Bid Release
Gate 4
Delivery Review
Gate 5 Out of Service
ReviewGate 6 Disposal/Close
Gate 3b
Contract Review/ Project
Start
Engineering Life Cycle
Delivery Life Cycle
Introduction
• System designers and engineers use reliability / safety engineering
techniques to address a number of areas
– Regulatory compliance – UL, CE
– Product safety – 2001/95/EC General Product Safety
– Demonstrate suitability for the end application
• Aerospace / Military
• Medical
• Industrial and process
• Automotive
• Telecommunications
– Ensure warranty period can be achieved
It is not all just about MTBF
• IEC61058 – Functional Safety of Electrical, Electronic,
Programmable Electronic Safety Systems. Introduces the
famous SIL level, lead to many variants :-
– IEC62279 – Railway Implementation
– IEC61511 – Process Industry Implementation
– IEC61513 – Nuclear Industry Implementation
– IEC62061 – Machinery Implementation
– ISO26262 – Automotive Electrical / Electronic Systems
• DO254 – Design Assurance Guidance for Airborne Electronic
hardware
Introduction
Many different standards
Introduction
• The needs for reliability engineering can be different
Of course depending upon the end application your reliability
engineering approach can be tailored to best suit
Introduction
• Requirements capture
– Baseline of system, Operating life, Environment, Required Probability of success
or MTBF
• Design for reliability
– De-rating, Redundancy, ECC, Event Logs, Scrubbing, Command Sequences,
PBIT, CBIT and IBIT, Watchdogs, Elapsed Time Counters, Single Points of
Failure, Failure Propagation.
• Part Stress Analysis
– Determine stresses on components
• Independent Design Reviews
– Peer review to catch gotcha’s
What techniques can we use
Introduction
• Reliability Block Diagrams
– Graphical representation of the system from a reliability standpoint.
• Failure Mode Effect and Criticality Analysis
– Address failure mode of components, detection mechanisms and effects on
system.
• Accelerated testing
– Allows MTBF to be confirmed in shorter time scale
• Burn In
– Allows screening out of infant mortality
What techniques can we use
Introduction
• Design Analysis to demonstrate failure mode effects
• Design Analysis to demonstrate stress upon components
• Reliability block diagram
• Mean Time Between Failure
• Probability of Success
The end result
Introduction
What happens when we do not consider it
The basics
If a system has a MTBF of 8766 Hours (1
year) does that guarantee the system will
operate without error for a year?
What is Mean Time Between Failure and What does it mean
The basics
NO the MTBF is used as an input to calculate the probability of success
[P(s)] of a mission for a specified duration but on its own it does not
ensure the system will operate error free for the MTBF. The probability
of success is given by
P(s) = E^(-t/MTBF)
Where t is the mission time in hours
What is Mean Time Between Failure and What does it mean
The basics
0
0.2
0.4
0.6
0.8
1
1.2
0
50
0
10
00
15
00
20
00
25
00
30
00
35
00
40
00
45
00
50
00
55
00
60
00
65
00
70
00
75
00
80
00
85
00
90
00
95
00
100
00
105
00
110
00
115
00
120
00
125
00
130
00
135
00
140
00
145
00
150
00
155
00
160
00
Pro
bab
ility
of
Succ
ess
Time Hours
Probability of Success for MTBF of 1 Year
What is Mean Time Between Failure and What does it mean
The Basics
Development flow – For a simple module
Part Stress Analysis
Worse Case Analysis
FMECA
Reliability Block Diagrams
Analysis
Internal Architecture
Replication (TMR)
ECC
IO allocation
FPGA Design
Defensive Design
Isolation
Component Selection
HW Design
Redundancy Switching
Module IF
Failure Propagation
HW Architecture
Redundancy
Failure modes
Single Points of Failure
Interfaces
Telemetry &
Commands
System
Function
Availability
Operational Life
Environment
Requirements
Requirements
• Identification of the stakeholders in the project
– Customer – Most Obvious
– Regulatory – FAA, ESA, NASA
– Internal stakeholders – design, manufacturing, test
• Requirement specification tree
– System Requirements
– Sub System Requirements
– Full Traceable back to the level above
– Eases verification and compliance
A solid base
Requirements
• Requirements shall contain
– System life time
– Operating environments
• Temperature – Maximum, Minimum, Rate of Change
• Shock and Vibration
• Radiation
• EMC / EMI / ESD
– MTBF or even better Probability of Success
– Permissible Down time
A solid base
Requirements
System Requirement
Sub System B Sub System A
The basics
System Level
• Not the architecture of the system but also how it is verified and
tested.
• Architecture of solution
– Redundancy
• Inter or Intra module
• Hot or Cold Redundant and system impacts
– Equipment interfaces – single connector or prime and redundant
– Dangerous Failure modes – can the fault propagate, or worse
– Telemetry and Event logs – health monitoring and recording
– Critical command sequences – Separate ARM & FIRE commands
– Acceptable Error Rate – BER, ECC on RAMS etc
Let the fun begin
System Level
• Introduced when a module cannot be demonstrated to meet the
probability of success required.
• Requires Independence between module A and B, especially on
Input and Output
• Has significant increase in probability of success
Redundancy
ModuleIP OP Module A
Module B
IP OP
System Level
Connectors
Module 1 Module 2Error
Module 1 Module 2
Error
Redundant Connector
Single Interface
Prime Connector
System Level
Code Error Correction Error Detection Comment
Parity X Errors can be masked
N of M X Not suitable to multiple bit
CRC X Good for burst errors
BCH X X Easy to Decode
Hamming X X One of First EDC
Reed Solomon X X Special case of BCH
Error Correcting and Detecting Codes
System Level
• Used to monitor the health of the system
• Also allows reconstruction of events should a failure occur if a fault
log is provided.
Telemetry and Event Logs
Parameter Comment
Temperature Unit and Critical Component
Current Drawn from main supply
Voltages The health of sub regulated voltages within the system
Redundancy Switching Reports position of any switches used to route Prime /Redundant signals
Processing Status Results of CRC, Over Under flow of calculations signals out of range etc
System
• Most command sequences require a single command, However
critical sequences can utilise a ARM / FIRE approach. Two
commands need to cause an action.
Control Sequences
ARM
ACK
Start TimeoutIf next valid command not received within time limit then errorFIRE
ACK
Any other command than one expected results in a error
HW
Most Common failure modes - FPGA
Power supplies failing
• If FPGA is connected directly to an external module we need to
prevent propagation therefore FPGA power supplies need to
consider Over-Voltage and Under-Voltage protection
• Also need to consider the thermal aspects of a failure, even if it does
not have direct external connections
HW
• Protection prevents this from happening, provided the response time
is fast enough.
• At the very least over and under voltage protection should be
included on voltage rails which power external interfaces. Unless the
interface is isolated in another manner e.g. Transformer or AC
coupled.
• Under voltage protection is used to detect a event which is
collapsing the voltage rail and disables the power supply system
Most Common failure modes - FPGA
FPGA
Six main Areas of consideration – besides the device FIT Rate
Clock & Reset IO EffectsIO
PowerDevice Level
Effects Parametric
drift
PerformanceDevice Selection
Single Event Effects
FPGA
Like all components the FPGA will have a FIT rate typically this which
contributes towards the overall failure rate of the system.
Source Xilinx UG116
The basic component
Device Geometry FIT
28 nm 11
40 nm 7
90 nm 8
130 nm 5
180 nm 14
FPGA
• High reliability systems generally have a longer operating life due to
the increased cost of development.
• Assuming a 20 year operating life with a probability of success of
0.99 the MTBF required to achieve this is 17444193 Hours.
The basic component
DeviceGeometry
FIT MTBF
28 nm 11 90909091 5
40 nm 7 142857143 8
90 nm 8 125000000 7
130 nm 5 200000000 11
180 nm 14 71428571 4
FPGA
Looking in more detail
• There are some aspects we cannot mitigate for in the FPGA, most
obvious being power failure (we can mitigate at the system level)
• Many engineers think introducing triple modular redundancy will
address most issues….
– What kind of TMR
– What aspects will it address
• Also other aspects to consider
– Some within the FPGA
– Some require system level addressing
FPGA
Technique IO Clock & Reset
Performance Power SET SEU ParametricDrift
ConfigurationCorruption
Local TMR No No No No No Yes No Yes
Global TMR Yes Yes No No Yes Yes No Yes
Large Grain
TMR Yes Yes No No Yes Yes No Yes
Checking
Input Values Yes Yes Yes No No No No No
Checking
Results No Yes Yes No Yes Yes No No
Memory
EDAC No No No No No Yes No No
Looking in more detail
FPGA
• Triplicates Flip Flops and votes on the output
• Best used in slow deigns to stop SET being clocked in
• Offers area advantage as combinatorial logic is not replicated
• Mitigates both user and configuration memory
Local TMR
FPGA
• Triplicates all resources in the design including IOB and Clocks,
reset trees
• Protects entire design from SEU and SET
• Protects from errors in configuration memory – but it does not
correct them
• More complicated to implement
– Areas Penalty reduces size of design to be implemented
– Validation is increased need to ensure final bit file is implemented as
desired
– Will have an impact upon power of the design
– Need to manage clock skew carefully
• TMR needs to re synchronise
Global TMR
FPGA
Global TMR
Need to ensure spatial separation in
the implemented device
FPGA
• Triplicates all resources in the design including IOB and Clocks,
reset trees, HOWEVER Flip Flops are not voted upon
• Uses one voter prior to the output of the three modules.
• Unlike Global TMR the FF are not resynchronised
– Can be used with partial reconfiguration to reconfigure a incorrect chain
if required.
• It has minimal domain cross points unlike global TMR.
• Mitigates both configuration and user logic errors
• Like global TMR it has area and power penalties
Large Grain TMR
FPGA
Large Grain TMR
FPGA
Synchronisation of Flip Flops
FPGA
Synthesis tools are very sophisticated and mostly designed to produce
highly optimised logic. This may change the behaviour of the logic as
you designed it within your VHDL.
Be careful with replication and fan out,
often you may need to consider
attributes (VHDL) to prevent
logic optimisations.
Be careful of Synthesis!!!!!
FPGA
• Mitigation techniques such as TMR mask configuration errors but
they do not correct them.
• If the fault occurs because it is a configuration upset then it will need
correcting to clear the issue
• Not all configuration bits are used within a design, therefore some
flips can be tolerated.
SRAM Configuration Memory
FPGA
Configuration MemoryReset – Async or
Sync
Op Selection & carry chain logic
LUT Configuration
Clock Polarity
Flip Flop Configuration
FPGA
Configuration Memory
Two Input AND Gate
SEU causes corruption but it is masked as not used
in design
SEU causes corruption impact
design performance
FPGA
Large pin count FPGA are typically high pin count devices in a BGA or
LGA format.
Careful selection of the IO pins used can therefore increase reliability
IO Placement
FPGA
• If necessary there are techniques that can be built into the FPGA
which will monitor for pin failures in the FPA such that you can
monitor the health of the interconnect.
• For instance Solder Joint BIST
IO Placement
FPGA
• Solder-less interconnect - uses land grid FPGA in place of BGA
IO placement
Virtex 5QV Space Grade FPGA.
1000 Cycles -55 to 100 C
Vibration to ECSS-Q-ST-70-08C
FPGA
• If you have a redundant system, control lines will need to be controlled by
both sides i.e. prime and redundant.
• In this case you need to ensure that a failure of one side cannot prevent the
control signal being activated by the correctly functioning side.
• Most popular method of combining digital signals is to use low forward
voltage drop, schottky diodes.
Prime and Redundant IO combination
5 Pulse(0 5 0 500n 500n 50u)
V1
SCHOTTKY
D2
SCHOTTKY
D1
5 Pulse(0 5 0 500n 500n 50u)
V2
Reliable signal
Prime SignalRedundant Signal
FPGA
• As devices age and subject to temperature other environmental
aspect e.g. radiation
• Parameters of the device may alter therefore it is essential to
ensure the design is de-rated such that at the end of life timings is
still achieved.
• Therefore extra margin on the timing analysis is required, increasing
the clock frequency by 10% above the operating frequency should
used
Parametric Drift
Conclusion
• Very Brief introduction to reliability and aspect of the life cycle at
which to address it.
• One final thing to mention
– Consider failure modes in both directions
• Forwards and backwards
• What appears to be OK when considering failures forwards may actually
introduce issues when considered backward
• Thank you for your attention
QUESTIONS ?