Design Reliability 101 - Tips, Techniques and How...

Preview:

Citation preview

Design Reliability 101 - Tips,

Techniques and How to

#eelive Produced by EE Times

Introduction

To ensure design quality regardless of end application the developing

organisation must have a defined engineering governance policy

Engineering Governance

Initiation PlanningExecute

(Doing)Concept

Kick OFF SRR PDR CDR / TRR / TRBExpression of Interest /

RFQ

Project Definition Implementation

SDR

Functional Baseline

Established

Requirements

Baseline

Established

Demonstration of

Preliminary

Design against

Functional and

Requirement

Baselines

Demonstration

of design

maturity

against

Functional and

Requirement

Baselines

Ensures test

articles,

facilities,

equipment

personal and

test procedure

are available

Review of Test

and

qualification

results against

Requirements

baseline

Initiation PlanningExecute

(Doing)ClosureConcept In-Service

Business Case Generation Business Case Delivery In-Service Delivery

IMPACTS

Gate 1 Register of

InterestGate 2 Bid/No Bid Gate 3a Bid Release

Gate 4

Delivery Review

Gate 5 Out of Service

ReviewGate 6 Disposal/Close

Gate 3b

Contract Review/ Project

Start

Engineering Life Cycle

Delivery Life Cycle

Introduction

• System designers and engineers use reliability / safety engineering

techniques to address a number of areas

– Regulatory compliance – UL, CE

– Product safety – 2001/95/EC General Product Safety

– Demonstrate suitability for the end application

• Aerospace / Military

• Medical

• Industrial and process

• Automotive

• Telecommunications

– Ensure warranty period can be achieved

It is not all just about MTBF

• IEC61058 – Functional Safety of Electrical, Electronic,

Programmable Electronic Safety Systems. Introduces the

famous SIL level, lead to many variants :-

– IEC62279 – Railway Implementation

– IEC61511 – Process Industry Implementation

– IEC61513 – Nuclear Industry Implementation

– IEC62061 – Machinery Implementation

– ISO26262 – Automotive Electrical / Electronic Systems

• DO254 – Design Assurance Guidance for Airborne Electronic

hardware

Introduction

Many different standards

Introduction

• The needs for reliability engineering can be different

Of course depending upon the end application your reliability

engineering approach can be tailored to best suit

Introduction

• Requirements capture

– Baseline of system, Operating life, Environment, Required Probability of success

or MTBF

• Design for reliability

– De-rating, Redundancy, ECC, Event Logs, Scrubbing, Command Sequences,

PBIT, CBIT and IBIT, Watchdogs, Elapsed Time Counters, Single Points of

Failure, Failure Propagation.

• Part Stress Analysis

– Determine stresses on components

• Independent Design Reviews

– Peer review to catch gotcha’s

What techniques can we use

Introduction

• Reliability Block Diagrams

– Graphical representation of the system from a reliability standpoint.

• Failure Mode Effect and Criticality Analysis

– Address failure mode of components, detection mechanisms and effects on

system.

• Accelerated testing

– Allows MTBF to be confirmed in shorter time scale

• Burn In

– Allows screening out of infant mortality

What techniques can we use

Introduction

• Design Analysis to demonstrate failure mode effects

• Design Analysis to demonstrate stress upon components

• Reliability block diagram

• Mean Time Between Failure

• Probability of Success

The end result

Introduction

What happens when we do not consider it

The basics

If a system has a MTBF of 8766 Hours (1

year) does that guarantee the system will

operate without error for a year?

What is Mean Time Between Failure and What does it mean

The basics

NO the MTBF is used as an input to calculate the probability of success

[P(s)] of a mission for a specified duration but on its own it does not

ensure the system will operate error free for the MTBF. The probability

of success is given by

P(s) = E^(-t/MTBF)

Where t is the mission time in hours

What is Mean Time Between Failure and What does it mean

The basics

0

0.2

0.4

0.6

0.8

1

1.2

0

50

0

10

00

15

00

20

00

25

00

30

00

35

00

40

00

45

00

50

00

55

00

60

00

65

00

70

00

75

00

80

00

85

00

90

00

95

00

100

00

105

00

110

00

115

00

120

00

125

00

130

00

135

00

140

00

145

00

150

00

155

00

160

00

Pro

bab

ility

of

Succ

ess

Time Hours

Probability of Success for MTBF of 1 Year

What is Mean Time Between Failure and What does it mean

The Basics

Development flow – For a simple module

Part Stress Analysis

Worse Case Analysis

FMECA

Reliability Block Diagrams

Analysis

Internal Architecture

Replication (TMR)

ECC

IO allocation

FPGA Design

Defensive Design

Isolation

Component Selection

HW Design

Redundancy Switching

Module IF

Failure Propagation

HW Architecture

Redundancy

Failure modes

Single Points of Failure

Interfaces

Telemetry &

Commands

System

Function

Availability

Operational Life

Environment

Requirements

Requirements

• Identification of the stakeholders in the project

– Customer – Most Obvious

– Regulatory – FAA, ESA, NASA

– Internal stakeholders – design, manufacturing, test

• Requirement specification tree

– System Requirements

– Sub System Requirements

– Full Traceable back to the level above

– Eases verification and compliance

A solid base

Requirements

• Requirements shall contain

– System life time

– Operating environments

• Temperature – Maximum, Minimum, Rate of Change

• Shock and Vibration

• Radiation

• EMC / EMI / ESD

– MTBF or even better Probability of Success

– Permissible Down time

A solid base

Requirements

System Requirement

Sub System B Sub System A

The basics

System Level

• Not the architecture of the system but also how it is verified and

tested.

• Architecture of solution

– Redundancy

• Inter or Intra module

• Hot or Cold Redundant and system impacts

– Equipment interfaces – single connector or prime and redundant

– Dangerous Failure modes – can the fault propagate, or worse

– Telemetry and Event logs – health monitoring and recording

– Critical command sequences – Separate ARM & FIRE commands

– Acceptable Error Rate – BER, ECC on RAMS etc

Let the fun begin

System Level

• Introduced when a module cannot be demonstrated to meet the

probability of success required.

• Requires Independence between module A and B, especially on

Input and Output

• Has significant increase in probability of success

Redundancy

ModuleIP OP Module A

Module B

IP OP

System Level

Connectors

Module 1 Module 2Error

Module 1 Module 2

Error

Redundant Connector

Single Interface

Prime Connector

System Level

Code Error Correction Error Detection Comment

Parity X Errors can be masked

N of M X Not suitable to multiple bit

CRC X Good for burst errors

BCH X X Easy to Decode

Hamming X X One of First EDC

Reed Solomon X X Special case of BCH

Error Correcting and Detecting Codes

System Level

• Used to monitor the health of the system

• Also allows reconstruction of events should a failure occur if a fault

log is provided.

Telemetry and Event Logs

Parameter Comment

Temperature Unit and Critical Component

Current Drawn from main supply

Voltages The health of sub regulated voltages within the system

Redundancy Switching Reports position of any switches used to route Prime /Redundant signals

Processing Status Results of CRC, Over Under flow of calculations signals out of range etc

System

• Most command sequences require a single command, However

critical sequences can utilise a ARM / FIRE approach. Two

commands need to cause an action.

Control Sequences

ARM

ACK

Start TimeoutIf next valid command not received within time limit then errorFIRE

ACK

Any other command than one expected results in a error

HW

Most Common failure modes - FPGA

Power supplies failing

• If FPGA is connected directly to an external module we need to

prevent propagation therefore FPGA power supplies need to

consider Over-Voltage and Under-Voltage protection

• Also need to consider the thermal aspects of a failure, even if it does

not have direct external connections

HW

• Protection prevents this from happening, provided the response time

is fast enough.

• At the very least over and under voltage protection should be

included on voltage rails which power external interfaces. Unless the

interface is isolated in another manner e.g. Transformer or AC

coupled.

• Under voltage protection is used to detect a event which is

collapsing the voltage rail and disables the power supply system

Most Common failure modes - FPGA

FPGA

Six main Areas of consideration – besides the device FIT Rate

Clock & Reset IO EffectsIO

PowerDevice Level

Effects Parametric

drift

PerformanceDevice Selection

Single Event Effects

FPGA

Like all components the FPGA will have a FIT rate typically this which

contributes towards the overall failure rate of the system.

Source Xilinx UG116

The basic component

Device Geometry FIT

28 nm 11

40 nm 7

90 nm 8

130 nm 5

180 nm 14

FPGA

• High reliability systems generally have a longer operating life due to

the increased cost of development.

• Assuming a 20 year operating life with a probability of success of

0.99 the MTBF required to achieve this is 17444193 Hours.

The basic component

DeviceGeometry

FIT MTBF

28 nm 11 90909091 5

40 nm 7 142857143 8

90 nm 8 125000000 7

130 nm 5 200000000 11

180 nm 14 71428571 4

FPGA

Looking in more detail

• There are some aspects we cannot mitigate for in the FPGA, most

obvious being power failure (we can mitigate at the system level)

• Many engineers think introducing triple modular redundancy will

address most issues….

– What kind of TMR

– What aspects will it address

• Also other aspects to consider

– Some within the FPGA

– Some require system level addressing

FPGA

Technique IO Clock & Reset

Performance Power SET SEU ParametricDrift

ConfigurationCorruption

Local TMR No No No No No Yes No Yes

Global TMR Yes Yes No No Yes Yes No Yes

Large Grain

TMR Yes Yes No No Yes Yes No Yes

Checking

Input Values Yes Yes Yes No No No No No

Checking

Results No Yes Yes No Yes Yes No No

Memory

EDAC No No No No No Yes No No

Looking in more detail

FPGA

• Triplicates Flip Flops and votes on the output

• Best used in slow deigns to stop SET being clocked in

• Offers area advantage as combinatorial logic is not replicated

• Mitigates both user and configuration memory

Local TMR

FPGA

• Triplicates all resources in the design including IOB and Clocks,

reset trees

• Protects entire design from SEU and SET

• Protects from errors in configuration memory – but it does not

correct them

• More complicated to implement

– Areas Penalty reduces size of design to be implemented

– Validation is increased need to ensure final bit file is implemented as

desired

– Will have an impact upon power of the design

– Need to manage clock skew carefully

• TMR needs to re synchronise

Global TMR

FPGA

Global TMR

Need to ensure spatial separation in

the implemented device

FPGA

• Triplicates all resources in the design including IOB and Clocks,

reset trees, HOWEVER Flip Flops are not voted upon

• Uses one voter prior to the output of the three modules.

• Unlike Global TMR the FF are not resynchronised

– Can be used with partial reconfiguration to reconfigure a incorrect chain

if required.

• It has minimal domain cross points unlike global TMR.

• Mitigates both configuration and user logic errors

• Like global TMR it has area and power penalties

Large Grain TMR

FPGA

Large Grain TMR

FPGA

Synchronisation of Flip Flops

FPGA

Synthesis tools are very sophisticated and mostly designed to produce

highly optimised logic. This may change the behaviour of the logic as

you designed it within your VHDL.

Be careful with replication and fan out,

often you may need to consider

attributes (VHDL) to prevent

logic optimisations.

Be careful of Synthesis!!!!!

FPGA

• Mitigation techniques such as TMR mask configuration errors but

they do not correct them.

• If the fault occurs because it is a configuration upset then it will need

correcting to clear the issue

• Not all configuration bits are used within a design, therefore some

flips can be tolerated.

SRAM Configuration Memory

FPGA

Configuration MemoryReset – Async or

Sync

Op Selection & carry chain logic

LUT Configuration

Clock Polarity

Flip Flop Configuration

FPGA

Configuration Memory

Two Input AND Gate

SEU causes corruption but it is masked as not used

in design

SEU causes corruption impact

design performance

FPGA

Large pin count FPGA are typically high pin count devices in a BGA or

LGA format.

Careful selection of the IO pins used can therefore increase reliability

IO Placement

FPGA

• If necessary there are techniques that can be built into the FPGA

which will monitor for pin failures in the FPA such that you can

monitor the health of the interconnect.

• For instance Solder Joint BIST

IO Placement

FPGA

• Solder-less interconnect - uses land grid FPGA in place of BGA

IO placement

Virtex 5QV Space Grade FPGA.

1000 Cycles -55 to 100 C

Vibration to ECSS-Q-ST-70-08C

FPGA

• If you have a redundant system, control lines will need to be controlled by

both sides i.e. prime and redundant.

• In this case you need to ensure that a failure of one side cannot prevent the

control signal being activated by the correctly functioning side.

• Most popular method of combining digital signals is to use low forward

voltage drop, schottky diodes.

Prime and Redundant IO combination

5 Pulse(0 5 0 500n 500n 50u)

V1

SCHOTTKY

D2

SCHOTTKY

D1

5 Pulse(0 5 0 500n 500n 50u)

V2

Reliable signal

Prime SignalRedundant Signal

FPGA

• As devices age and subject to temperature other environmental

aspect e.g. radiation

• Parameters of the device may alter therefore it is essential to

ensure the design is de-rated such that at the end of life timings is

still achieved.

• Therefore extra margin on the timing analysis is required, increasing

the clock frequency by 10% above the operating frequency should

used

Parametric Drift

Conclusion

• Very Brief introduction to reliability and aspect of the life cycle at

which to address it.

• One final thing to mention

– Consider failure modes in both directions

• Forwards and backwards

• What appears to be OK when considering failures forwards may actually

introduce issues when considered backward

• Thank you for your attention

QUESTIONS ?

Recommended