45
Design Reliability 101 - Tips, Techniques and How to #eelive Produced by EE Times

Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Embed Size (px)

Citation preview

Page 1: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Design Reliability 101 - Tips,

Techniques and How to

#eelive Produced by EE Times

Page 2: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

To ensure design quality regardless of end application the developing

organisation must have a defined engineering governance policy

Engineering Governance

Initiation PlanningExecute

(Doing)Concept

Kick OFF SRR PDR CDR / TRR / TRBExpression of Interest /

RFQ

Project Definition Implementation

SDR

Functional Baseline

Established

Requirements

Baseline

Established

Demonstration of

Preliminary

Design against

Functional and

Requirement

Baselines

Demonstration

of design

maturity

against

Functional and

Requirement

Baselines

Ensures test

articles,

facilities,

equipment

personal and

test procedure

are available

Review of Test

and

qualification

results against

Requirements

baseline

Initiation PlanningExecute

(Doing)ClosureConcept In-Service

Business Case Generation Business Case Delivery In-Service Delivery

IMPACTS

Gate 1 Register of

InterestGate 2 Bid/No Bid Gate 3a Bid Release

Gate 4

Delivery Review

Gate 5 Out of Service

ReviewGate 6 Disposal/Close

Gate 3b

Contract Review/ Project

Start

Engineering Life Cycle

Delivery Life Cycle

Page 3: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

• System designers and engineers use reliability / safety engineering

techniques to address a number of areas

– Regulatory compliance – UL, CE

– Product safety – 2001/95/EC General Product Safety

– Demonstrate suitability for the end application

• Aerospace / Military

• Medical

• Industrial and process

• Automotive

• Telecommunications

– Ensure warranty period can be achieved

It is not all just about MTBF

Page 4: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

• IEC61058 – Functional Safety of Electrical, Electronic,

Programmable Electronic Safety Systems. Introduces the

famous SIL level, lead to many variants :-

– IEC62279 – Railway Implementation

– IEC61511 – Process Industry Implementation

– IEC61513 – Nuclear Industry Implementation

– IEC62061 – Machinery Implementation

– ISO26262 – Automotive Electrical / Electronic Systems

• DO254 – Design Assurance Guidance for Airborne Electronic

hardware

Introduction

Many different standards

Page 5: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

• The needs for reliability engineering can be different

Of course depending upon the end application your reliability

engineering approach can be tailored to best suit

Page 6: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

• Requirements capture

– Baseline of system, Operating life, Environment, Required Probability of success

or MTBF

• Design for reliability

– De-rating, Redundancy, ECC, Event Logs, Scrubbing, Command Sequences,

PBIT, CBIT and IBIT, Watchdogs, Elapsed Time Counters, Single Points of

Failure, Failure Propagation.

• Part Stress Analysis

– Determine stresses on components

• Independent Design Reviews

– Peer review to catch gotcha’s

What techniques can we use

Page 7: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

• Reliability Block Diagrams

– Graphical representation of the system from a reliability standpoint.

• Failure Mode Effect and Criticality Analysis

– Address failure mode of components, detection mechanisms and effects on

system.

• Accelerated testing

– Allows MTBF to be confirmed in shorter time scale

• Burn In

– Allows screening out of infant mortality

What techniques can we use

Page 8: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

• Design Analysis to demonstrate failure mode effects

• Design Analysis to demonstrate stress upon components

• Reliability block diagram

• Mean Time Between Failure

• Probability of Success

The end result

Page 9: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Introduction

What happens when we do not consider it

Page 10: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

The basics

If a system has a MTBF of 8766 Hours (1

year) does that guarantee the system will

operate without error for a year?

What is Mean Time Between Failure and What does it mean

Page 11: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

The basics

NO the MTBF is used as an input to calculate the probability of success

[P(s)] of a mission for a specified duration but on its own it does not

ensure the system will operate error free for the MTBF. The probability

of success is given by

P(s) = E^(-t/MTBF)

Where t is the mission time in hours

What is Mean Time Between Failure and What does it mean

Page 12: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

The basics

0

0.2

0.4

0.6

0.8

1

1.2

0

50

0

10

00

15

00

20

00

25

00

30

00

35

00

40

00

45

00

50

00

55

00

60

00

65

00

70

00

75

00

80

00

85

00

90

00

95

00

100

00

105

00

110

00

115

00

120

00

125

00

130

00

135

00

140

00

145

00

150

00

155

00

160

00

Pro

bab

ility

of

Succ

ess

Time Hours

Probability of Success for MTBF of 1 Year

What is Mean Time Between Failure and What does it mean

Page 13: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

The Basics

Development flow – For a simple module

Part Stress Analysis

Worse Case Analysis

FMECA

Reliability Block Diagrams

Analysis

Internal Architecture

Replication (TMR)

ECC

IO allocation

FPGA Design

Defensive Design

Isolation

Component Selection

HW Design

Redundancy Switching

Module IF

Failure Propagation

HW Architecture

Redundancy

Failure modes

Single Points of Failure

Interfaces

Telemetry &

Commands

System

Function

Availability

Operational Life

Environment

Requirements

Page 14: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Requirements

• Identification of the stakeholders in the project

– Customer – Most Obvious

– Regulatory – FAA, ESA, NASA

– Internal stakeholders – design, manufacturing, test

• Requirement specification tree

– System Requirements

– Sub System Requirements

– Full Traceable back to the level above

– Eases verification and compliance

A solid base

Page 15: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Requirements

• Requirements shall contain

– System life time

– Operating environments

• Temperature – Maximum, Minimum, Rate of Change

• Shock and Vibration

• Radiation

• EMC / EMI / ESD

– MTBF or even better Probability of Success

– Permissible Down time

A solid base

Page 16: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Requirements

System Requirement

Sub System B Sub System A

The basics

Page 17: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

System Level

• Not the architecture of the system but also how it is verified and

tested.

• Architecture of solution

– Redundancy

• Inter or Intra module

• Hot or Cold Redundant and system impacts

– Equipment interfaces – single connector or prime and redundant

– Dangerous Failure modes – can the fault propagate, or worse

– Telemetry and Event logs – health monitoring and recording

– Critical command sequences – Separate ARM & FIRE commands

– Acceptable Error Rate – BER, ECC on RAMS etc

Let the fun begin

Page 18: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

System Level

• Introduced when a module cannot be demonstrated to meet the

probability of success required.

• Requires Independence between module A and B, especially on

Input and Output

• Has significant increase in probability of success

Redundancy

ModuleIP OP Module A

Module B

IP OP

Page 19: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

System Level

Connectors

Module 1 Module 2Error

Module 1 Module 2

Error

Redundant Connector

Single Interface

Prime Connector

Page 20: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

System Level

Code Error Correction Error Detection Comment

Parity X Errors can be masked

N of M X Not suitable to multiple bit

CRC X Good for burst errors

BCH X X Easy to Decode

Hamming X X One of First EDC

Reed Solomon X X Special case of BCH

Error Correcting and Detecting Codes

Page 21: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

System Level

• Used to monitor the health of the system

• Also allows reconstruction of events should a failure occur if a fault

log is provided.

Telemetry and Event Logs

Parameter Comment

Temperature Unit and Critical Component

Current Drawn from main supply

Voltages The health of sub regulated voltages within the system

Redundancy Switching Reports position of any switches used to route Prime /Redundant signals

Processing Status Results of CRC, Over Under flow of calculations signals out of range etc

Page 22: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

System

• Most command sequences require a single command, However

critical sequences can utilise a ARM / FIRE approach. Two

commands need to cause an action.

Control Sequences

ARM

ACK

Start TimeoutIf next valid command not received within time limit then errorFIRE

ACK

Any other command than one expected results in a error

Page 23: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

HW

Most Common failure modes - FPGA

Power supplies failing

• If FPGA is connected directly to an external module we need to

prevent propagation therefore FPGA power supplies need to

consider Over-Voltage and Under-Voltage protection

• Also need to consider the thermal aspects of a failure, even if it does

not have direct external connections

Page 24: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

HW

• Protection prevents this from happening, provided the response time

is fast enough.

• At the very least over and under voltage protection should be

included on voltage rails which power external interfaces. Unless the

interface is isolated in another manner e.g. Transformer or AC

coupled.

• Under voltage protection is used to detect a event which is

collapsing the voltage rail and disables the power supply system

Most Common failure modes - FPGA

Page 25: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Six main Areas of consideration – besides the device FIT Rate

Clock & Reset IO EffectsIO

PowerDevice Level

Effects Parametric

drift

PerformanceDevice Selection

Single Event Effects

Page 26: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Like all components the FPGA will have a FIT rate typically this which

contributes towards the overall failure rate of the system.

Source Xilinx UG116

The basic component

Device Geometry FIT

28 nm 11

40 nm 7

90 nm 8

130 nm 5

180 nm 14

Page 27: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• High reliability systems generally have a longer operating life due to

the increased cost of development.

• Assuming a 20 year operating life with a probability of success of

0.99 the MTBF required to achieve this is 17444193 Hours.

The basic component

DeviceGeometry

FIT MTBF

28 nm 11 90909091 5

40 nm 7 142857143 8

90 nm 8 125000000 7

130 nm 5 200000000 11

180 nm 14 71428571 4

Page 28: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Looking in more detail

• There are some aspects we cannot mitigate for in the FPGA, most

obvious being power failure (we can mitigate at the system level)

• Many engineers think introducing triple modular redundancy will

address most issues….

– What kind of TMR

– What aspects will it address

• Also other aspects to consider

– Some within the FPGA

– Some require system level addressing

Page 29: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Technique IO Clock & Reset

Performance Power SET SEU ParametricDrift

ConfigurationCorruption

Local TMR No No No No No Yes No Yes

Global TMR Yes Yes No No Yes Yes No Yes

Large Grain

TMR Yes Yes No No Yes Yes No Yes

Checking

Input Values Yes Yes Yes No No No No No

Checking

Results No Yes Yes No Yes Yes No No

Memory

EDAC No No No No No Yes No No

Looking in more detail

Page 30: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• Triplicates Flip Flops and votes on the output

• Best used in slow deigns to stop SET being clocked in

• Offers area advantage as combinatorial logic is not replicated

• Mitigates both user and configuration memory

Local TMR

Page 31: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• Triplicates all resources in the design including IOB and Clocks,

reset trees

• Protects entire design from SEU and SET

• Protects from errors in configuration memory – but it does not

correct them

• More complicated to implement

– Areas Penalty reduces size of design to be implemented

– Validation is increased need to ensure final bit file is implemented as

desired

– Will have an impact upon power of the design

– Need to manage clock skew carefully

• TMR needs to re synchronise

Global TMR

Page 32: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Global TMR

Need to ensure spatial separation in

the implemented device

Page 33: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• Triplicates all resources in the design including IOB and Clocks,

reset trees, HOWEVER Flip Flops are not voted upon

• Uses one voter prior to the output of the three modules.

• Unlike Global TMR the FF are not resynchronised

– Can be used with partial reconfiguration to reconfigure a incorrect chain

if required.

• It has minimal domain cross points unlike global TMR.

• Mitigates both configuration and user logic errors

• Like global TMR it has area and power penalties

Large Grain TMR

Page 34: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Large Grain TMR

Page 35: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Synchronisation of Flip Flops

Page 36: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Synthesis tools are very sophisticated and mostly designed to produce

highly optimised logic. This may change the behaviour of the logic as

you designed it within your VHDL.

Be careful with replication and fan out,

often you may need to consider

attributes (VHDL) to prevent

logic optimisations.

Be careful of Synthesis!!!!!

Page 37: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• Mitigation techniques such as TMR mask configuration errors but

they do not correct them.

• If the fault occurs because it is a configuration upset then it will need

correcting to clear the issue

• Not all configuration bits are used within a design, therefore some

flips can be tolerated.

SRAM Configuration Memory

Page 38: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Configuration MemoryReset – Async or

Sync

Op Selection & carry chain logic

LUT Configuration

Clock Polarity

Flip Flop Configuration

Page 39: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Configuration Memory

Two Input AND Gate

SEU causes corruption but it is masked as not used

in design

SEU causes corruption impact

design performance

Page 40: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

Large pin count FPGA are typically high pin count devices in a BGA or

LGA format.

Careful selection of the IO pins used can therefore increase reliability

IO Placement

Page 41: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• If necessary there are techniques that can be built into the FPGA

which will monitor for pin failures in the FPA such that you can

monitor the health of the interconnect.

• For instance Solder Joint BIST

IO Placement

Page 42: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• Solder-less interconnect - uses land grid FPGA in place of BGA

IO placement

Virtex 5QV Space Grade FPGA.

1000 Cycles -55 to 100 C

Vibration to ECSS-Q-ST-70-08C

Page 43: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• If you have a redundant system, control lines will need to be controlled by

both sides i.e. prime and redundant.

• In this case you need to ensure that a failure of one side cannot prevent the

control signal being activated by the correctly functioning side.

• Most popular method of combining digital signals is to use low forward

voltage drop, schottky diodes.

Prime and Redundant IO combination

5 Pulse(0 5 0 500n 500n 50u)

V1

SCHOTTKY

D2

SCHOTTKY

D1

5 Pulse(0 5 0 500n 500n 50u)

V2

Reliable signal

Prime SignalRedundant Signal

Page 44: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

FPGA

• As devices age and subject to temperature other environmental

aspect e.g. radiation

• Parameters of the device may alter therefore it is essential to

ensure the design is de-rated such that at the end of life timings is

still achieved.

• Therefore extra margin on the timing analysis is required, increasing

the clock frequency by 10% above the operating frequency should

used

Parametric Drift

Page 45: Design Reliability 101 - Tips, Techniques and How toadiuvoengineering.com/wp-content/uploads/2015/01/Design... · Design Reliability 101 - Tips, Techniques and How to ... maturity

Conclusion

• Very Brief introduction to reliability and aspect of the life cycle at

which to address it.

• One final thing to mention

– Consider failure modes in both directions

• Forwards and backwards

• What appears to be OK when considering failures forwards may actually

introduce issues when considered backward

• Thank you for your attention

QUESTIONS ?