51
Mirek Generowicz I&E Systems Pty Ltd - Australia Preventing preventable failures

Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

  • Upload
    others

  • View
    8

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Mirek GenerowiczIampE Systems Pty Ltd - Australia

Preventing preventable failures

The need for credible failure ratesIEC 61511 Edition 2 (2016) reduced requirements for hardware fault toleranceInstead the new edition emphasises that the lsquoreliability data used in quantifying effect of random failures should be credible traceable documented and justified based of field feedback from similar devices used in a similar operating environmentrsquo (IEC 61511-1 sect1193)

In practice how do we find lsquocredible reliability datarsquo

Answering this question revealed that most failure probability calculations are based on a misunderstanding of probability theory

lsquoWrong but usefulrsquoFailure probability calculations depend on estimates of equipment failure ratesThe basic assumption has been that during mid-life the failure rate is constant and fixed

This assumption is WRONGbull The rate is not fixedbull hellipbut the calculations are still useful

We need a different emphasisWe cannot predict probability of failure with precision

bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved

In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates

by preventing preventable failures

Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions

So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo

lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo

HFT in IEC 61511 Edition 2New simpler requirement for HFT

SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60

or lsquodominant failure to the safe statersquo (difficult to achieve)

SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 2: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

The need for credible failure ratesIEC 61511 Edition 2 (2016) reduced requirements for hardware fault toleranceInstead the new edition emphasises that the lsquoreliability data used in quantifying effect of random failures should be credible traceable documented and justified based of field feedback from similar devices used in a similar operating environmentrsquo (IEC 61511-1 sect1193)

In practice how do we find lsquocredible reliability datarsquo

Answering this question revealed that most failure probability calculations are based on a misunderstanding of probability theory

lsquoWrong but usefulrsquoFailure probability calculations depend on estimates of equipment failure ratesThe basic assumption has been that during mid-life the failure rate is constant and fixed

This assumption is WRONGbull The rate is not fixedbull hellipbut the calculations are still useful

We need a different emphasisWe cannot predict probability of failure with precision

bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved

In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates

by preventing preventable failures

Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions

So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo

lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo

HFT in IEC 61511 Edition 2New simpler requirement for HFT

SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60

or lsquodominant failure to the safe statersquo (difficult to achieve)

SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 3: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

lsquoWrong but usefulrsquoFailure probability calculations depend on estimates of equipment failure ratesThe basic assumption has been that during mid-life the failure rate is constant and fixed

This assumption is WRONGbull The rate is not fixedbull hellipbut the calculations are still useful

We need a different emphasisWe cannot predict probability of failure with precision

bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved

In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates

by preventing preventable failures

Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions

So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo

lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo

HFT in IEC 61511 Edition 2New simpler requirement for HFT

SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60

or lsquodominant failure to the safe statersquo (difficult to achieve)

SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 4: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

We need a different emphasisWe cannot predict probability of failure with precision

bull Failure rates are not fixed and constantbull We can estimate PFD within an order of magnitude onlybull We can set feasible targets for failure ratesbull We must design the system so that the rates can be achieved

In operation we need to monitor the actual PFD achievedbull We must measure actual failure rates and analyse all failuresbull Manage performance to meet target failure rates

by preventing preventable failures

Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions

So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo

lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo

HFT in IEC 61511 Edition 2New simpler requirement for HFT

SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60

or lsquodominant failure to the safe statersquo (difficult to achieve)

SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 5: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Hardware fault toleranceCalculations of failure probability are not enough because they depend on very coarse assumptions

So IEC 61511 also specifies minimum requirements for redundancy ie lsquoHardware fault tolerancersquo

lsquoto alleviate potential shortcomings in SIF designthat may result due to the number of assumptions made in the design of the SIF along with uncertainty in the failure rate of components or subsystems used in various process applicationsrsquo

HFT in IEC 61511 Edition 2New simpler requirement for HFT

SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60

or lsquodominant failure to the safe statersquo (difficult to achieve)

SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 6: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

HFT in IEC 61511 Edition 2New simpler requirement for HFT

SIL 3 is allowed with only two block valves (HFT =1)bull Previously SIL 3 with only two valves needed SFF gt 60

or lsquodominant failure to the safe statersquo (difficult to achieve)

SIL 2 low demand needs only one block valve (HFT = 0)bull Previously two valves were required for SIL 2

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 7: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Based on the IEC 61508 lsquoRoute 2Hrsquo method bull Introduced in 2010bull Requires a confidence level of ge 90 in target failure measures

(only 70 needed for Route 1H)eg λAVG = No of failures recorded total time in service

IEC 61511 Ed 2 only requires 70 confidence level instead of 90 but similar to Route 2H it requires credible traceable failure rate data based on field feedback from similar devices in a similar operating environment

HFT in IEC 61511 Edition 2

Why not 90

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 8: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

So what is confidence levelThe uncertainty in the rate of independent eventsdepends on how many events are measuredConfidence level relates to the width of the uncertainty bandwhich depends on the number of failures recorded

We can evaluate the uncertainty with the chi-squared function χ2

α = 1- confidence levelν = degrees of freedom in this case = 2(n + 1)n = the number of failures in the given time periodT = the number of device-years or device-hours ie the number of devices x the given time period

Confidence level does not depend directly on population size

120582120582120572120572 =120594120594

2(120572120572 120584120584)2119879119879

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 9: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

λ90 compared with λ70 12058212058290 =

120594120594 2(012119899119899 + 2)

2119879119879

Population 100 100 100 100 Time in service hours 50000 50000 50000 50000 Device-hours 5000000 5000000 5000000 5000000 Number of failures - 1 10 100 MTBF hours - 5000000 500000 50000

λAVG per hour - 20E-07 20E-06 20E-05

λ50 per hour 14E-07 34E-07 21E-06 20E-05

λ70 per hour 24E-07 49E-07 25E-06 21E-05

λ90 per hour 46E-07 78E-07 31E-06 23E-05

λ90 λ50 = 33 23 14 11

λ90 λ70 = 19 16 12 11

Sheet1

Sheet1 (2)

No of Failures5185245353654254855456566572578584595966026086146262663263864465656662668674010000000000010210122324010Normalised51852453536542548554565665725785845959660260861462626632638644656566626686743321715500142365E-478535083273682067E-417721667881399007E-338166741820102514E-378452211438572589E-315390931395451751E-228818025657473711E-251499516795988166E-287837849363731388E-2014298808267644045022215577164965006032942379648769365046622112105375196062975051284732986081186669259713351099894258697455829117310244893531281314834114201872914065189732369916143601784063437441399309186083297413013890374143291115515486642914290978617300633262020791270865931755061062850105469024044974731139426577

Sheet1 (3)

No of Failures000000000000000000000000000000000000000000000000000000Normalised000000000000000000000000000000000000000000000000000000

Sheet1 (4)

No of Failures000000000000000000000000000022222222222222222222222222Normalised00000000000000000000000000014606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-314606917077973783E-3

Sheet1 (5)

No of Failures017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999803015522101010000100000000000Normalised017000000000000001E-200584000000000000005E-201170000000000000101510184021802510284999999999999980318035199999999999998038500000000000001041899999999999998045200000000000001048599999999999999051900000000000002055300000000000005058599999999999997062065300000000000002068700000000000006072075407870000000000000308209999999999999508539999999999999818448778923921592190102167771003641973798348129817619940974276338115195909933729266311869678694027107517380866987829955156692744227338961378312567519713117379456948856850976979155663941780785953056161571740618990781965257140470394700061008470350545476212014370251645712022447401774458529765168501203311177699105880287630878632707E-251431319803697223E-232470780931593383E-219648920208379789E-211738108677223571E-267098221778135354E-337928426011661085E-320480692366113182E-310954506472897741E-3

LogN

No of Failures034000000000000002E-20101000000000000010168000000000000010234999999999999990301999999999999990368999999999999990436050305699999999999999506370000000000000107039999999999999607710000000000000208379999999999999709050000000000000309719999999999999810389999999999999110600000000000011173124130699999999999991374000000000000114410000000000001150815751641999999999999917090000000000001100001001101111212323445678Normalised034000000000000002E-2010100000000000001016800000000000001023499999999999999030199999999999999036899999999999999043605030569999999999999950637000000000000010703999999999999960771000000000000020837999999999999970905000000000000030971999999999999981038999999999999911060000000000001117312413069999999999999137400000000000011441000000000000115081575164199999999999991709000000000000131243120164361816E-341547353133152649E-371333981142155641E-311908714659136097E-219330747212137175E-230510387203599144E-246823311193290855E-269870178197628641E-2010137640934409677014302016814138127019618824812563462026167607648580699033936759848750758042794903990523786052472174601445809062557789600992131072518540107810781081739511057528436089583978094728278095464956235458098098917455683914879099659159886441961097628548392180459092993323133438321086127537161318468077561752815957064067915441311327174

LogInv

No of Failures0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-029199999999999998000067764533322211111101001Normalised0-19670000000000001-19-1833-1766-16990000000000001-16319999999999999-15649999999999999-1498-1431-13640000000000001-12969999999999999-123-1163-10960000000000001-10289999999999999-096199999999999997-089500000000000002-082799999999999996-076100000000000001-069399999999999995-0627-056000000000000005-049299999999999999-042599999999999999-035899999999999999-02919999999999999831243120164361816E-3031337349984807805039871396405492698049326005805749479059334293937700289069398635768745032078924422564644070872744696773940840938379149478157930981035686148001180997255837708906580985697589235426720947318736266245630885245820431920460804353523186979610710633010098576490610462669769383570509903709533805260414126056347780820327033518847819280251111905424581340187481174269724301361016379967580996069242083916498E-265935712318029902E-244001994828101985E-228552185361885606E-2

Sheet2

Sheet3

image1png

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Min Max 30 steps step size bottom bin mean Stddev
- 200 - 0 200 007 007 -$200 - 1 0
0 0 assume min-max is 5 std dev
-03010299957 03010299957 $000
-04771212547 04771212547 Bin Bottom Bin Label No of Failures Normalised
-06020599913 06020599913 -2 - 0 0 0003124312
-06989700043 06989700043 -1933 - 196700 0 03133734998
-07781512504 07781512504 -1866 - 190000 0 03987139641
-084509804 084509804 -1799 - 183300 0 04932600581
-0903089987 0903089987 -1732 - 176600 6 05933429394
-09542425094 09542425094 -1665 - 169900 7 06939863577
-1 1 -1598 - 163200 7 07892442256
-10413926852 10413926852 -1531 - 156500 6 08727446968
-1079181246 1079181246 -1464 - 149800 4 09383791495
-11139433523 11139433523 -1397 - 143100 5 09810356861
-11461280357 11461280357 -133 - 136400 3 09972558377
-11760912591 11760912591 -1263 - 129700 3 09856975892
-12041199827 12041199827 -1196 - 123000 3 09473187363
-12304489214 12304489214 -1129 - 116300 2 08852458204
-12552725051 12552725051 -1062 - 109600 2 08043535232
-1278753601 1278753601 -0995 - 102900 2 07106330101
-13010299957 13010299957 -0928 - 096200 1 06104626698
-13222192947 13222192947 -0861 - 089500 1 05099037095
-13424226808 13424226808 -0794 - 082800 1 04141260563
-1361727836 1361727836 -0727 - 076100 1 03270335188
-13802112417 13802112417 -066 - 069400 1 02511119054
-13979400087 13979400087 -0593 - 062700 1 01874811743
-1414973348 1414973348 -0526 - 056000 0 0136101638
-14313637642 14313637642 -0459 - 049300 1 00960692421
-14471580313 14471580313 -0392 - 042600 0 00659357123
-14623979979 14623979979 -0325 - 035900 0 00440019948
-14771212547 14771212547 -0258 - 029200 1 00285521854
-14913616938 14913616938
-15051499783 15051499783
-15185139399 15185139399
-1531478917 1531478917
-15440680444 15440680444
-15563025008 15563025008
-15682017241 15682017241
-15797835966 15797835966
-1591064607 1591064607
-16020599913 16020599913
-16127838567 16127838567
-16232492904 16232492904
-16334684556 16334684556
-16434526765 16434526765
-16532125138 16532125138
-16627578317 16627578317
-16720978579 16720978579
-16812412374 16812412374
-169019608 169019608
-16989700043 16989700043
-17075701761 17075701761
-17160033436 17160033436
-17242758696 17242758696
-17323937598 17323937598
-17403626895 17403626895
-1748188027 1748188027
-17558748557 17558748557
-17634279936 17634279936
-17708520116 17708520116
Average - 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF - 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s - 2 hours
l90 -54E-01 per hour
Mean -74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 26E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 200 200 007 007 $000 1 0
0 assume min-max is 5 std dev
03010299957 $000
04771212547 Bin Bottom Bin Label No of Failures Normalised
06020599913 0 - 0 1 0003124312
06989700043 0067 003400 0 00041547353
07781512504 0134 010100 0 00071333981
084509804 0201 016800 0 00119087147
0903089987 0268 023500 0 00193307472
09542425094 0335 030200 1 00305103872
1 0402 036900 0 00468233112
10413926852 0469 043600 0 00698701782
1079181246 0536 050300 1 01013764093
11139433523 0603 057000 1 01430201681
11461280357 067 063700 0 01961882481
11760912591 0737 070400 1 02616760765
12041199827 0804 077100 1 03393675985
12304489214 0871 083800 1 04279490399
12552725051 0938 090500 1 0524721746
1278753601 1005 097200 2 0625577896
13010299957 1072 103900 1 07251854011
13222192947 1139 110600 2 08173951106
13424226808 1206 117300 3 08958397809
1361727836 1273 124000 2 09546495624
13802112417 134 130700 3 09891745568
13979400087 1407 137400 4 09965915989
1414973348 1474 144100 4 09762854839
14313637642 1541 150800 5 09299332313
14471580313 1608 157500 6 08612753716
14623979979 1675 164200 7 07756175282
14771212547 1742 170900 8 06791544131
14913616938
15051499783
15185139399
1531478917
15440680444
15563025008
15682017241
15797835966
1591064607
16020599913
16127838567
16232492904
16334684556
16434526765
16532125138
16627578317
16720978579
16812412374
169019608
16989700043
17075701761
17160033436
17242758696
17323937598
17403626895
1748188027
17558748557
17634279936
17708520116
Average 135834
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 1 hours
Standard deviation 0 hours
13s 1 hours
MTBF-13s 1 hours
l90 12E+00 per hour
Mean 74E-01 per hour
SD 26E+00 per hour
13SD 33E+00 per hour
Mean + 13SD 41E+00 per hour
Min Max 30 steps step size bottom bin mean Stddev
- 0 100 100 003 003 $000 0 0
1 10E+00 assume min-max is 5 std dev
05 20E+00 $000
03333333333 30E+00 Bin Bottom Bin Label No of Failures Normalised
025 40E+00 0 - 0 0 18448778924
02 50E+00 00335 001700 30 19010216777
01666666667 60E+00 0067 005000 15 19737983481
01428571429 70E+00 01005 008400 5 19940974276
0125 80E+00 0134 011700 2 19590993373
01111111111 90E+00 01675 015100 2 1869678694
01 10E+01 0201 018400 1 17380866988
00909090909 11E+01 02345 021800 0 15669274423
00833333333 12E+01 0268 025100 1 13783125675
00769230769 13E+01 03015 028500 0 11737945695
00714285714 14E+01 0335 031800 1 09769791557
00666666667 15E+01 03685 035200 0 07859530562
00625 16E+01 0402 038500 0 0618990782
00588235294 17E+01 04355 041900 0 04703947001
00555555556 18E+01 0469 045200 0 03505454762
00526315789 19E+01 05025 048600 1 0251645712
005 20E+01 0536 051900 0 0177445853
00476190476 21E+01 05695 055300 0 01203311178
00454545455 22E+01 0603 058600 0 00802876309
00434782609 23E+01 06365 062000 0 00514313198
00416666667 24E+01 067 065300 0 00324707809
004 25E+01 07035 068700 0 00196489202
00384615385 26E+01 0737 072000 0 00117381087
0037037037 27E+01 07705 075400 0 00067098222
00357142857 28E+01 0804 078700 0 00037928426
00344827586 29E+01 08375 082100 0 00020480692
00333333333 30E+01 0871 085400 0 00010954506
00322580645 31E+01
003125 32E+01
00303030303 33E+01
00294117647 34E+01
00285714286 35E+01
00277777778 36E+01
0027027027 37E+01
00263157895 38E+01
00256410256 39E+01
0025 40E+01
00243902439 41E+01
00238095238 42E+01
0023255814 43E+01
00227272727 44E+01
00222222222 45E+01
00217391304 46E+01
00212765957 47E+01
00208333333 48E+01
00204081633 49E+01
002 50E+01
00196078431 51E+01
00192307692 52E+01
00188679245 53E+01
00185185185 54E+01
00181818182 55E+01
00178571429 56E+01
00175438596 57E+01
00172413793 58E+01
00169491525 59E+01
Average 007904 hours 300E+01
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 0 hours
Standard deviation 0 hours
13s 0 hours
MTBF-13s - 0 hours
l90 -89E+00 per hour
Mean 13E+01 per hour
SD 68E+00 per hour
13SD 88E+00 per hour
Mean + 13SD 21E+01 per hour
Min Max 30 steps step size bottom bin mean Stddev
Number inverse - 0 6000 6000 200 200 $000 30 12
1 10E+00 assume min-max is 5 std dev
2 50E-01 $000
3 33E-01 Bin Bottom Bin Label No of Failures Normalised
4 25E-01 - 0 - 0 0 00014606917
5 20E-01 2 - 0 2 00014606917
6 17E-01 4 - 0 2 00014606917
7 14E-01 6 - 0 2 00014606917
8 13E-01 8 - 0 2 00014606917
9 11E-01 10 - 0 2 00014606917
10 10E-01 12 - 0 2 00014606917
11 91E-02 14 - 0 2 00014606917
12 83E-02 16 - 0 2 00014606917
13 77E-02 18 - 0 2 00014606917
14 71E-02 20 - 0 2 00014606917
15 67E-02 22 - 0 2 00014606917
16 63E-02 24 - 0 2 00014606917
17 59E-02 26 - 0 2 00014606917
18 56E-02 28 - 0 2 00014606917
19 53E-02 30 - 0 2 00014606917
20 50E-02 32 - 0 2 00014606917
21 48E-02 34 - 0 2 00014606917
22 45E-02 36 - 0 2 00014606917
23 43E-02 38 - 0 2 00014606917
24 42E-02 40 - 0 2 00014606917
25 40E-02 42 - 0 2 00014606917
26 38E-02 44 - 0 2 00014606917
27 37E-02 46 - 0 2 00014606917
28 36E-02 48 - 0 2 00014606917
29 34E-02 50 - 0 2 00014606917
30 33E-02 52 - 0 2 00014606917
31 32E-02
32 31E-02
33 30E-02
34 29E-02
35 29E-02
36 28E-02
37 27E-02
38 26E-02
39 26E-02
40 25E-02
41 24E-02
42 24E-02
43 23E-02
44 23E-02
45 22E-02
46 22E-02
47 21E-02
48 21E-02
49 20E-02
50 20E-02
51 20E-02
52 19E-02
53 19E-02
54 19E-02
55 18E-02
56 18E-02
57 18E-02
58 17E-02
59 17E-02
Average 30 hours 790E-02
33E-02
Period - 0 days
- 0 hours
Population 1000 devices
Time in service - 0 Device-hours
Number of failures - 0
MTBF ERRORDIV0 hours ERRORDIV0 years
l50 ERRORDIV0 per hour
Confidence level
l90 ERRORDIV0 per hour 90
Check mean failure rate using χsup2
l50 ERRORDIV0 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 30 hours
Standard deviation 17 hours
13s 22 hours
MTBF-13s 8 hours
l90 13E-01 per hour
Mean 33E-02 per hour
SD 58E-02 per hour
13SD 76E-02 per hour
Mean + 13SD 11E-01 per hour
Time between failures 25 steps step size bottom bin mean Stddev
Start 1110 device-hours - 0 - 0 000 $000 536133 - 0
Failure 1510 962281 10E-06 ERRORDIV0 96E+05 ERRORDIV0 10E-06 ERRORDIV0 assume min-max is 5 std dev
Failure 1710 648197 15E-06 ERRORDIV0 65E+05 ERRORDIV0 15E-06 ERRORDIV0 ERRORDIV0
Failure 11110 814662 12E-06 ERRORDIV0 81E+05 ERRORDIV0 12E-06 ERRORDIV0 Bin Label No of Failures Normalised
Failure 11110 141452 71E-06 ERRORDIV0 14E+05 ERRORDIV0 71E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11510 1004475 10E-06 ERRORDIV0 10E+06 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11810 622443 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 11910 325476 31E-06 ERRORDIV0 33E+05 ERRORDIV0 31E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12010 222575 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12210 508397 20E-06 ERRORDIV0 51E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12510 571250 18E-06 ERRORDIV0 57E+05 ERRORDIV0 18E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12610 394182 25E-06 ERRORDIV0 39E+05 ERRORDIV0 25E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 12910 644171 16E-06 ERRORDIV0 64E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13010 124490 80E-06 ERRORDIV0 12E+05 ERRORDIV0 80E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 13110 425465 24E-06 ERRORDIV0 43E+05 ERRORDIV0 24E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2410 954421 10E-06 ERRORDIV0 95E+05 ERRORDIV0 10E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2610 492722 20E-06 ERRORDIV0 49E+05 ERRORDIV0 20E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2810 482169 21E-06 ERRORDIV0 48E+05 ERRORDIV0 21E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 45598 22E-05 ERRORDIV0 46E+04 ERRORDIV0 22E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 2910 174599 57E-06 ERRORDIV0 17E+05 ERRORDIV0 57E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 153784 65E-06 ERRORDIV0 15E+05 ERRORDIV0 65E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21010 75110 13E-05 ERRORDIV0 75E+04 ERRORDIV0 13E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 21410 912650 11E-06 ERRORDIV0 91E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 664521 15E-06 ERRORDIV0 66E+05 ERRORDIV0 15E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 21710 37380 27E-05 ERRORDIV0 37E+04 ERRORDIV0 27E-05 ERRORDIV0 - 0 0 ERRORNUM
Failure 22010 759032 13E-06 ERRORDIV0 76E+05 ERRORDIV0 13E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22210 534288 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 22510 615725 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3110 949343 11E-06 ERRORDIV0 95E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3510 927866 11E-06 ERRORDIV0 93E+05 ERRORDIV0 11E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3610 225532 44E-06 ERRORDIV0 23E+05 ERRORDIV0 44E-06 ERRORDIV0 - 0 0 ERRORNUM
Failure 3710 391125 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 3810 271703 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31010 463530 22E-06 ERRORDIV0 46E+05 ERRORDIV0 22E-06 ERRORDIV0
Failure 31310 691314 14E-06 ERRORDIV0 69E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 31410 273336 37E-06 ERRORDIV0 27E+05 ERRORDIV0 37E-06 ERRORDIV0
Failure 31810 772942 13E-06 ERRORDIV0 77E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 32210 972713 10E-06 ERRORDIV0 97E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 32510 850558 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 32710 528257 19E-06 ERRORDIV0 53E+05 ERRORDIV0 19E-06 ERRORDIV0
Failure 4110 987792 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 4310 590263 17E-06 ERRORDIV0 59E+05 ERRORDIV0 17E-06 ERRORDIV0
Failure 4510 385461 26E-06 ERRORDIV0 39E+05 ERRORDIV0 26E-06 ERRORDIV0
Failure 4610 224526 45E-06 ERRORDIV0 22E+05 ERRORDIV0 45E-06 ERRORDIV0
Failure 4810 679543 15E-06 ERRORDIV0 68E+05 ERRORDIV0 15E-06 ERRORDIV0
Failure 41210 845334 12E-06 ERRORDIV0 85E+05 ERRORDIV0 12E-06 ERRORDIV0
Failure 41210 24362 41E-05 ERRORDIV0 24E+04 ERRORDIV0 41E-05 ERRORDIV0
Failure 41510 621767 16E-06 ERRORDIV0 62E+05 ERRORDIV0 16E-06 ERRORDIV0
Failure 41810 788953 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 41910 366532 27E-06 ERRORDIV0 37E+05 ERRORDIV0 27E-06 ERRORDIV0
Failure 42310 780944 13E-06 ERRORDIV0 78E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42410 251133 40E-06 ERRORDIV0 25E+05 ERRORDIV0 40E-06 ERRORDIV0
Failure 42710 786268 13E-06 ERRORDIV0 79E+05 ERRORDIV0 13E-06 ERRORDIV0
Failure 42810 207976 48E-06 ERRORDIV0 21E+05 ERRORDIV0 48E-06 ERRORDIV0
Failure 5210 985398 10E-06 ERRORDIV0 99E+05 ERRORDIV0 10E-06 ERRORDIV0
Failure 5510 698150 14E-06 ERRORDIV0 70E+05 ERRORDIV0 14E-06 ERRORDIV0
Failure 5910 874164 11E-06 ERRORDIV0 87E+05 ERRORDIV0 11E-06 ERRORDIV0
Failure 51110 499082 20E-06 ERRORDIV0 50E+05 ERRORDIV0 20E-06 ERRORDIV0
Failure 51110 101503 99E-06 ERRORDIV0 10E+05 ERRORDIV0 99E-06 ERRORDIV0
Failure 51210 302989 33E-06 ERRORDIV0 30E+05 ERRORDIV0 33E-06 ERRORDIV0
End 93015 Average 536133 hours 408E-06 ERRORDIV0 536E+05 ERRORDIV0 408E-06 ERRORDIV0
19E-06 per hour
Period 2098 days
50352 hours
Population 10000 devices
Time in service 503520000 Device-hours
Number of failures 1000 59 59 59 59 59 59
MTBF 503520 hours 57 years
l50 20E-06 per hour
Confidence level
l90 21E-06 per hour 90 10410605082 ERRORVALUE 18 ERRORDIV0 ERRORDIV0 ERRORVALUE
Check mean failure rate using χsup2
l50 20E-06 per hour 50
Alternatively using mean and standard deviation to derive l90
MTBF 536133 hours
Standard deviation 294906 hours
13s 383378 hours
MTBF-13s 152756 hours
l90 65E-06 per hour
Mean 19E-06 per hour
SD 34E-06 per hour
13SD 44E-06 per hour
Mean + 13SD 63E-06 per hour
Population 100 100 100 100 10 10 10 10 100 100 100 100
Time in service hours 50000 50000 50000 50000 1000 1000 1000 1000 10000 10000 10000 10000
Device-hours 5000000 5000000 5000000 5000000 10000 10000 10000 10000 1000000 1000000 1000000 1000000
Number of failures - 0 1 10 100 1 10 100 4 - 0 1 10 100
MTBF hours - 0 5000000 500000 50000 10000 1000 100 2500 - 0 1000000 100000 10000
lAVG per hour - 0 20E-07 20E-06 20E-05 10E-04 10E-03 10E-02 40E-04 - 0 10E-06 10E-05 10E-04
05 l50 per hour 14E-07 34E-07 21E-06 20E-05 17E-04 11E-03 10E-02 47E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
07 l70 per hour 24E-07 49E-07 25E-06 21E-05 24E-04 12E-03 11E-02 59E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
09 l90 per hour 46E-07 78E-07 31E-06 23E-05 39E-04 15E-03 11E-02 80E-04 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l50 = 33 23 14 11 ERRORVALUE ERRORVALUE ERRORVALUE ERRORVALUE
l90 l70 = 19 16 12 11 16 12 11 14 - 0 ERRORVALUE ERRORVALUE ERRORVALUE
48784 249390 2120378
Time between failures Min Max 25 steps step size bottom bin mean Stddev
Start 1110 device-hours 526 665 139 006 006 $550 6 0
Failure 5310 2949106 34E-07 64696904217 -64696904217 assume min-max is 5 std dev
Failure 82110 2632591 38E-07 64203833357 -64203833357 $000
Failure 113010 2428252 41E-07 63852938381 -63852938381 Bin Bottom Bin Label No of Failures Normalised
Failure 42711 3550948 28E-07 65503443066 -65503443066 5 518 0 00003321716
Failure 72111 2029839 49E-07 63074616502 -63074616502 5 524 1 00007853508
Failure 121411 3514339 28E-07 65458436058 -65458436058 5 530 0 00017721668
Failure 2312 1224713 82E-07 60880342113 -60880342113 5 536 0 00038166742
Failure 5412 2185019 46E-07 63394551674 -63394551674 5 542 0 00078452211
Failure 82012 2589443 39E-07 64132063897 -64132063897 6 548 0 00153909314
Failure 10812 1173440 85E-07 60694607774 -60694607774 6 554 0 00288180257
Failure 101612 182448 55E-06 52611385494 -52611385494 6 560 0 00514995168
Failure 42013 4467842 22E-07 66500977884 -66500977884 6 566 0 00878378494
Failure 9613 3344062 30E-07 65242743034 -65242743034 6 572 0 01429880827
Failure 1314 2843747 35E-07 64538909553 -64538909553 6 578 0 02221557716
Failure 41514 2461238 41E-07 63911536882 -63911536882 6 584 0 03294237965
Failure 91814 3746271 27E-07 65735991807 -65735991807 6 590 0 04662211211
Failure 12514 1875103 53E-07 6273025093 -6273025093 6 596 1 06297505128
Failure 5415 3597103 28E-07 65559528244 -65559528244 6 602 0 08118666926
Failure 62915 1342319 74E-07 6127855577 -6127855577 6 608 2 0998942587
Failure 8415 858503 12E-06 59337415921 -59337415921 6 614 1 11731024489
End 93015 Average 2449816 hours 723E-07 avg 63 - 63 6 620 0 13148341142
6 626 1 14065189732
41E-07 per hour sd 031368536 031368536 6 632 2 14360178406
avg+13SD 67 -67244861308 6 638 2 13993091861
Period 2098 days 6 644 3 13013890374
50352 hours MTBF90 5302567 5302567 7 650 2 11551548664
Lambda90 189E-07 189E-07 7 656 4 09786173006
Population 1000 devices 7 662 0 07912708659
Time in service 50352000 Device-hours 189E-07 00000001886 7 668 1 06106285011
Number of failures 20 7 674 0 04497473114
MTBF 2517600 hours
l50 40E-07 per hour
-67389046286
-6718562883
-67704516495
-69038810225
-68746281123
Confidence level -67719319389
l90 54E-07 per hour 90 -62699281143 -66371588608
Check mean failure rate using χsup2 -66979903128
l50 41E-07 per hour 50 005
Alternatively using mean and standard deviation to derive l90
MTBF 2449816 hours
Standard deviation 1106903 hours
13s 1438973 hours
MTBF-13s 1010843 hours
l90 99E-07 per hour
Mean 41E-07 per hour
SD 90E-07 per hour
13SD 12E-06 per hour
Mean + 13SD 16E-06 per hour
Start 1110
Failure 93011
Failure 6313
Failure 11414
End 93015
Period 2098 days
50352 hours
Population 100 devices
Time in service 5035200 Device-hours
Number of failures 3
MTBF 1678400 hours
l50 60E-07 per hour
Confidence level
l90 13E-06 per hour 90
Check mean failure rate using χsup2
l50 73E-07 per hour 50
Page 10: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Can we be confidentUsers have collected so much data over many years in a variety of environments

With an MTBF of 100 years to measure 10 failures we only need 100 similar devices in service for 10 years

Surely by now we must have recorded enough many failures for most commonly applied devices

λ90 asymp λ70 asymp λAVG

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 11: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Where can we find credible data

An enormous effort has been made to determine λSome widely used sources

bull OREDA lsquoOffshore and Onshore Reliability Handbookrsquobull SINTEF PDS Data Handbook

lsquoReliability Data for Safety Instrumented Systemsrsquobull exida database incorporated into exSILentia software

and the SERH lsquoSafety Equipment Reliability Handbookrsquo bull FARADIP databasebull Usersrsquo own failure rate data

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 12: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Combining data from multiple sourcesOREDAcombines data from multiple sources in a probability distribution but the spread is very wideUncertainty intervals span 1 or 2 orders of magnitude

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 13: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Wide variation in reported λWhy is the variation so wide if λ is constant

OREDA includes all mid-life failures‒ only some are random‒ some are systematic (depending on the application)‒ site-specific factors influence failure rate

λ is NOT constant χ2 function confidence levels do not apply OREDA provides lower and upper deciles to show uncertainty

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 14: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

A long-standing debateShould the calculated probability of failure take into account systematic failures

The intent of IEC 61508 IEC 61511 and ISOTR 12489 is to calculate failure probability only for random failure

ISA TR840002 exida SINTEF PDS Method and OREDA specifically include some systematic failures

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 15: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

The intent of the standards

Manage risk of systematic (ie preventable) failuresbull Prevent errors and failures in design and implementationbull By applying quality management methods

Reduce risk of random hardware failuresbull For the failures that cannot be effectively preventedbull Calculate failure probability based on failure ratesbull Reduce the probability of failure to achieve the

required risk reduction target‒ Apply diagnostics for fault detection and regular periodic testing‒ Apply redundant equipment for fault tolerance

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 16: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

What is randomWhich of these are random

bull Tossing a coinbull Horse racebull Football match

If I record 47 heads in 100 tosses P(head) asymp

If a football team wins 47 matches out of 100 what is the probability they will win their next match

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 17: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

What is randomDictionary

Made done or happening without method or conscious decision haphazard

MathematicsA purely random process involves mutually independent events The probability of any one event is not dependent on other events

These definitions are significantly different

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 18: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Guidance from ISOTR 12489

ISOTR 12489 provides detailed explanation and guidance for the failure probability calculations given in IEC 61508-6

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 19: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Guidance from ISOTR 12489RandomHardware ndash electronic components constant λHardware ndash mechanical components non-constant λ(age and wear related failures in mid-life period)Human ndash operating under stress non-routine variable λSystematic - cannot be quantified by a fixed rateHardware - specification design installation operationSoftware - specification coding testing modification Human ndash depending on training understanding attitude

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 20: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Purely random failureOnly lsquocatalecticrsquo failures have constant failure ratesISOTR 12489 sect329 catalectic failuresudden and complete failureNote 1 to entry [hellip] a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletelyNote 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 21: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Definition in IEC 61511 and IEC 61508IEC 61511 sect3259 and IEC 61508-4 sect365random hardware failurefailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwareNote 1 to entry There are many degradation mechanisms occurring at different rates in different components and since manufacturing tolerances cause components to fail due to these mechanisms after different times in operation failures of a total equipment comprising many components occur at predictable rates but at unpredictable (ie random) times

Not limited to lsquocatalecticrsquo failureCannot be characterised by fixed constant failure rate ndashthough the rates are measurable and may be predictable

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 22: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Random or systematic failuresPressure transmitters

bull Blocked tubingbull Corroded diaphragmbull Sudden electronic component failurebull Calibration drift due to vibrationbull Overheated transducerbull Tubing leakbull Isolation valve closedbull High impedance jointbull Water ingress partial short circuitbull Supply voltage outside limitsbull Age or wear related deterioration (rate not constant)Most failures are partially systematic and partially random

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 23: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Quasi-random hardware failuresMost hardware failures are not purely randomThe failure causes are well known and understoodBut many failures cannot be prevented in practice

bull due to lack of maintenance resources and accessbull treated as quasi-random

Failure rates can be measuredbull may be reasonably constant for a given operatorbull but a wide variation between different operatorsbull not a fixed constant ratebull can be reduced by deliberate effort

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 24: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Failure probability depends on failure rateIf undetected device failures occur at a constant average rate then failures accumulate exponentially

eg 10 chance that a device picked at random has failed

119875119875119875119875119875119875 119905119905 = 0

119905119905λDU 119890119890minusλ119863119863119863119863120591120591119889119889119889119889 = 1 minus 119890119890minusλ119863119863119863119863119905119905

λt

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 25: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

The basic assumption in failure probabilityIf undetected device failures occur at a constant average rate then failures accumulate exponentially

SIFs always require PFDavg lt 01 in this region the accumulation is linear

PFD asymp λDUt

With a proof test interval T the average is simplyPFDavg asymp λDUT 2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 26: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common cause120573120573120582120582119863119863119863119863119879119879

2

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 27: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

PFD for 1oo2 final elementsIn the process sector SIF PFD is dominated by the PFD of final elements

With 1oo2 valves the PFD is approximately

119875119875119875119875119875119875119860119860119860119860119860119860 asymp 1 minus 120573120573 120582120582119863119863119863119863119879119879 2

3+

120573120573 is the fraction of failures that share common causeCommon cause failure always strongly dominates PFDThe PFD depends equally on 120582120582119863119863119863119863 120573120573 and 119879119879

120573120573120582120582119863119863119863119863119879119879

2

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 28: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Dependence on failure rateOur estimate of PFD is only as good as our estimate of the undetected failure rate λDU of the final elementsThe failure rate

bull can be measuredbull may be predictable bull is not a fixed constant rate

Typical failure rates are easy to obtainFailure rates can be controlled and reduced

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 29: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Managing and reducing failuresOREDA reports failure rates that are feasible in practiceTypically failure rates vary over at least an order of magnitudeThe scope for reduction in failure is clearReduction in rate by a factor of 10 may be feasible

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 30: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Typical feasible values for 120582120582119863119863119863119863Sensors typically have MTBF asymp 300 years120582120582119863119863119863119863 asymp 0003 per annum

Actuated valve assemblies typically have MTBF asymp 50 years120582120582119863119863119863119863 asymp 002 per annum

Contactors or relays typically have MTBF asymp 200 years120582120582119863119863119863119863 asymp 0005 per annum

These are order of magnitude estimates ndashbut are sufficient to show feasibility of PFD and RRF targets

(risk targets can only be in orders of magnitude)

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 31: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Typical feasible values for 120573120573IEC 61508-6 Annex D suggests a range of 1 to 10

Typically 120573120573 asymp 12 - 15 in practice (Ref SINTEF Report A26922)

bull difficult to reduce below 5 with similar devicesbull strongly dominates PFD of voted architectures

Minimise 120573120573 through bull independence and diversity in design and maintenance bull preventing systematic failures

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 32: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Typical values for 119879119879 depend on application Batch process 119879119879 lt 01 years

‒ Results in low PFD final elements might not dominate

Continuous process 119879119879 asymp 1 year

LNG production 119879119879 gt 4 years‒ Low PFD is more difficult to achieve

Production constraints may make it difficult to reduce 119879119879

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 33: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

What architecture is typically neededFor typical values 120573120573 asymp 10 120582120582119863119863119863119863 asymp 002 pa 119879119879 asymp 1 ySIL 1 is always easy to achieve without HFT

SIL 2 needs 1oo2 final elements unless overall 120582120582119863119863119863119863 can be reduced to lt 002 pa (MTBFDU gt 50 y)

SIL 3 would need 1oo2 final elements andoverall 120582120582119863119863119863119863 lt 002 pa

bull either 120582120582119863119863119863119863 120573120573 andor 119879119879 must be minimisedbull needs close attention in design

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 34: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Design assumptions set performance benchmarksProbability of SIF failure on demand is proportional to

The values assumed in design set performance standards for availability and reliability in operation and maintenance

Achieving these performance standards depends on the design and on operation and maintenance practices

120573120573 120582120582119863119863119863119863 119879119879 and MTTR

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 35: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Monitoring performance is essentialOperators must monitor the failure rates and modes

bull Calculate actual 120582120582119863119863119863119863 compare with design assumptionbull Analyse discrepancies between expected and actual behaviourbull Root cause analysis prevent the preventable failures

Operators must also monitor MTTR against targetsbull Safety functions deliver zero RRF while bypassed

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 36: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Design for maintainability to enable performanceFailure rates can be reduced if maintainers have ready access to the equipment for

bull inspectionbull testingbull maintenance and repair

Accessibility and testability must be considered and specified as design requirements

bull additional costbull requirement depends on target RRF PFD and 120582120582119863119863119863119863bull may only be necessary for SIL 2 and SIL 3

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 37: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Suitability of componentsCertification and prior use are also importantbull Evidence of quality ie systematic capabilitybull Evidence of suitability for service and environment

‒ preventing systematic failuresbull Volume of operating experience provides evidence that

‒ systematic capability is adequate‒ inspection and maintenance is effective‒ failures are monitored analysed and understood‒ failure rate performance benchmarks are achievable

(Refer to IEC 61511-1 sect1153 and IEC 61508-2 sect74433)

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 38: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Emphasis on prevention instead of calculationPrevent systematic failures through quality

bull systematic capability proven in usebull deliberate quality planning how much quality is enoughbull must always use stricter quality for SIL 3 functions

Design SIL 2 and SIL 3 functions to reduce 120640120640119915119915119915119915bull avoid systematic failures and common cause failures by designbull enable deterioration to be detected (diagnostics inspection test)bull enable equipment to be repaired or renewed before it failsbull ensure testability and accessibility

Monitor and control failure performance in operationbull reduce failure rates that exceed benchmarks

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 39: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

39

Credible reliability data

Performance targets must be feasible by design

equiv feasible performance targets

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 40: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 1

PREVENT PREVENTABLE FAILURES

M Generowicz FS Senior Expert (TUumlV Rheinland 18312) Engineering Manager A Hertel Principal Engineer

IampE Systems Pty Ltd

Perth Western Australia

Abstract

IEC 61511-1 Edition 2 (2016) has reduced the requirements for hardware fault tolerance in automated safety functions The new requirements are based on the IEC 61508-2 lsquoRoute 2Hrsquo method which relies on increased confidence levels in failure rates

Instead of requiring increased confidence level IEC 61511-1 requires lsquocredible and traceable reliability datarsquo in the quantification of failure probability It is not immediately obvious what this means in practice

Failure probability calculations are based on the assumption that failure rates are fixed and constant That assumption is invalid but the calculations are still useful

This paper discusses the reasons for the wide variability and uncertainty in measured failure rates It draws conclusions on how failure rates and failure probability can be controlled in practice

Summary

Automated safety functions are applied to achieve hazard risk reduction at industrial process facilities The calculations of risk reduction achieved have been based on failures being predominantly random in nature and failure rates being fixed and constant

Over the past several decades enough information has been collected to estimate failure rates for all commonly used components in safety functions The information shows that failure rates measured for any particular type of device vary by at least an order of magnitude between different users and different applications The variation depends largely on the service operating environment and maintenance practices It is clear that

bull Failures are almost never purely random and as a result

bull Failure rates are not fixed and constant

At best the risk reduction achieved by these safety functions can be estimated only within an order of magnitude Nevertheless even with such imprecise results the calculations are still useful

Failure rates from industry databases are useful in demonstrating the feasibility of the risk reduction targeted by safety functions This is important in setting operational reliability benchmarks

Failure rates measured from a facilityrsquos maintenance data are useful in demonstrating the risk reduction that a safety function can achieve for a given operating service environment and set of maintenance practices

Most failures in safety function components (including software) are predictable preventable or avoidable to some degree suggesting that many failures are mostly systematic in nature rather than

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 41: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 2

purely random Therefore safety function reliability performance can be improved through four key strategies

1 Eliminating systematic and common-cause failures throughout the design development and implementation processes and throughout operation and maintenance practices

2 Designing the equipment to allow access to enable sufficiently frequent inspection testing and maintenance and to enable suitable test coverage

3 Using risk-based inspection and condition-based maintenance techniques to

- Identify and then control conditions that induce early failures

- Actively prevent common-cause failures

4 Detailed analysis of all failures to

- Monitor failure rates

- Determine how failure recurrence may be prevented if failure rates need to be reduced

Functional safety depends on safety integrity

The fundamental objective of the functional safety standards is to ensure that automated safety systems reliably achieve specified levels of risk reduction Functional safety maintains safety integrity of assets in two ways

Systematic safety integrity deals with preventable failures These are failures resulting from errors and shortcomings in the design manufacture installation operation maintenance and modification of the safeguarding systems

Hardware safety integrity deals with controlling random hardware failures These are the failures that occur at a reasonably constant rate and are completely independent of each other They are not preventable and cannot be avoided or eliminated but the probability of these failures occurring (and the resulting risk reduction) can be calculated

Calculation methods

IEC 61508-62010 Annex B[4] provides basic guidance on evaluating probabilities of failure State-of-the-art methods for reliability calculations are described in more detail in the Technical Report ISO 12489[5]

Several other useful references are available on this subject including ISA-TR840002-2015[6] and the SINTEF PDS Method Handbook[7] These calculation methods enable users to estimate the PFD for safety functions and the corresponding RRF achieved The calculations are all based on these assumptions

bull Dangerous undetected failures of the devices in a safety function are characterised by fixed and constant failure rates

bull Failures occurring within a population of devices are assumed to be independent events

ISOTR 12489 uses the term lsquocatalectic failurersquo to clarify the types of failures that can be expected to occur at fixed rates

329 catalectic failure sudden and complete failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 42: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 3

Note 1 to entry This term has been introduced by analogy with the catalectic verses (ie a verse with seven foots instead of eight) which stop abruptly Then a catalectic failure occurs without warning and is more or less impossible to forecast by examining the item It is the contrary of failures occurring progressively and incompletely

Note 2 to entry Catalectic failures characterize simple components with constant failure rates (exponential law) they remain permanently ldquoas good as newrdquo until they fail suddenly completely and without warning Most of the probabilistic models used in reliability engineering are based on catalectic failures of the individual component of the system under study (eg Markovian approach)

If dangerous undetected failures occur independently at a fixed rate λDU within a population of similar components then undetected failures will accumulate exponentially The PFD of a component chosen at random is directly proportional to the number of failures that have accumulated in the population

The PFD of an overall system of devices or components can be estimated by applying probability theory to combine the PFD of the individual components

Hardware fault tolerance

The functional safety standards IEC 61508 and IEC 61511 recognise that there is always some degree of uncertainty in the assumptions made in calculation of failure rate and probability For this reason the standards specify a minimum level of fault tolerance (ie redundancy) in the architectural design of the safety functions The required level of redundancy increases with the risk reduction required

Designers aim to minimise the level of fault tolerance because the addition of fault tolerance increases the complexity and cost of safety functions Redundancy also increases the likelihood of inadvertent or spurious action which in itself may lead to increased risk of hazards

IEC 61508 provides two strategies for minimising the required hardware fault tolerance

bull Increasing the coverage of automatic and continuous diagnostic functions to reduce the rate of failures that remain undetected (lsquoRoute 1Hrsquo)

bull Increasing the confidence level in the measured failure rates to at least 90 (lsquoRoute 2Hrsquo)

A confidence level of 90 effectively means that there is only a 10 chance that the true average failure rate is greater than the estimated value

IEC 61511 Edition 2 adopts a strategy that is consistent with Route 2H though it requires only a confidence level of 70 However IEC 61511 also requires documentation showing that the failure rates are credible based on field feedback from a similar operating environment

Failure rate confidence level

If all of the failures for a given type of equipment are recorded the failure rate λ can be estimated with any required level of confidence by applying a χsup2 (chi-squared) distribution The failure rate estimated with confidence level of lsquoarsquo is designated λa The confidence level indicates the chance that the actual average failure rate is less than or equal to the estimated rate

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 43: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 4

The ratio of λ90 to λ70 depends only on the number of failures recorded It does not depend directly on the failure rate itself or on the population size The width of the uncertainty band becomes narrower with each recorded failure A good estimate for λ can be obtained with as few as 3 failures If 10 or more failures have been recorded the overall confidence is increased λ90 will be no more than about 20 higher than λ70

Failure rate sources

The Offshore and Onshore Reliability Data (OREDA) project provides a useful source of failure rate information gathered over many years by a consortium of oil and gas companies The preface to the OREDA handbooks clarifies that the failures considered are from the normal steady state operating period of equipment In general the data exclude infant mortality failures and end-of-life failures

The failure rate tables published by OREDA[11] show that failure rates recorded by different users typically vary over one or two orders of magnitude OREDA fits the reported failure rates into probability distributions to estimate the overall mean failure rate and standard deviation for each type of equipment and type of failure

The tables also show the upper and lower limits of a 90 uncertainty interval for the reported failure rates This is the band stretching from the 5 certainty level to the 95 certainty level The certainty level is not the same as the confidence level relating to a single dataset but the intent is similar

Two other widely used sources of failure rate data are the SINTEF PDS Data Handbook[8] and the exida failure rate database in exSILentia software The exida database is also published in the exida Safety Equipment Reliability Handbook[12] The failure rates in these references are reasonably consistent with the OREDA data though they do not give any indication of the wide uncertainty band

Variability in failure rates

It is evident from the OREDA tables that the failure rates are not constant across different users and different applications Some users consistently achieve failure rates at least 10 times lower than other users The implication here is that it may be feasible for other users to minimise their failure rates through best practice in design operation and maintenance

One reason for the variability in rates is that these datasets include all failures systematic failures as well as random failures

Random failure and systematic failure

It is important to understand the distinction between random failure and systematic failure The definitions for random failure vary between the different standards and references but they are generally consistent with the dictionary definitions of the word lsquorandomrsquo

lsquoMade done or happening without method or conscious decision haphazardrsquo

ISOTR 124892013[5] Annex B explains that both hardware failures and human failures can occur at random It makes it clear that not all random failures occur at a constant rate Constant failure rates are typical in electronic components before they reach the end of their useful life Random failures of mechanical components are caused by deterioration due to age and wear but the failure rates are not constant

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 44: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 5

The definitions of random failure in ISA TR840002-2015[6] IEC 61508-42010[3] and IEC 61511-12016[1] are all similar

lsquohellipfailure occurring at a random time which results from one or more of the possible degradation mechanisms in the hardwarersquo

The mathematical analysis of failure probability is based on the concept of a lsquorandom processrsquo or lsquostochastic processrsquo In this context the usage of the word random is narrower For the mathematical analysis these standards all assume that

lsquofailure rates arising from random hardware failures can be predicted with reasonable accuracyrsquo

Failures that occur at a fixed constant rate are purely random but in practice only a small proportion of random failures are purely random The following definition of a purely random process is from lsquoA Dictionary of Statistical Termsrsquo by FHC Marriott[13]

lsquoThe simplest example of a stationary process where in discrete time all the random variables z are mutually independent In continuous time the process is sometimes referred to as ldquowhite noiserdquo relating to the energy characteristics of certain physical phenomenarsquo

The key difference is the requirement for mutual independence Failures due to damage or deterioration from wear and age are not mutually independent and are not purely random

By contrast the standards note that systematic failures cannot be characterised by a fixed rate though to some extent the probability of systematic faults existing in a system may be estimated The standards are reasonably consistent in their definition of systematic failure

lsquohellipfailure related in a deterministic way to a certain cause or pre-existing fault Systematic failures can be eliminated after being detected while random hardware failures cannotrsquo

The distinction between random and systematic failures may be difficult to make for random failures that are not purely random Very few failures are purely random

Failures of mechanical components may seem to be random but the causes are usually well known and understood and are partially deterministic The failures can be prevented to some extent if the degradation is monitored and corrective maintenance can be completed when it is needed

Degradation of mechanical components is not a purely random process Degradation can be monitored

While it might be theoretically possible to prevent failure from degradation it is simply not practicable to prevent all failures Inspection and maintenance can never be perfect It is common practice to treat these failures as quasi-random to the extent that they are not eliminated through maintenance overhaul and renewal They are characterised by a constant failure rate even though that rate is not fixed

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 45: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 6

The reasons for the wide variability in reported failure rates seem to be clear

bull Only a small proportion of failures occur at a fixed rate that cannot be changed

bull The failure rates of mechanical components vary widely depending on service conditions and on effectiveness of maintenance

bull The failure rates of mechanical components can only be predicted with any reasonable accuracy within a given environment and maintenance regime

bull The rates of systematic failures vary widely from user to user depending on the effectiveness of the quality management practices

bull No clear distinction is made between systematic failure and random failure in the reported failures

Common cause failure

Where hardware fault tolerance is provided the common cause failures of redundant devices are modelled assuming a common cause factor β This is a fixed constant factor representing the fraction of failures with a common cause that will affect all of the devices at about the same time

Common cause failures are never purely random because they are not independent events These failures are largely systematic and preventable

Factors that dominate PFD

In the process sector the PFD of a safety function is usually strongly dominated by the PFD of the final elements This is because the rates of undetected failure of final elements are usually an order of magnitude higher than rates of undetected failure of sensors With fault tolerance in final elements the PFD is strongly dominated by common cause failures

As an example consider a safety function using actuated valves as final elements in a 1oo2 (ie 1 out of 2) redundant architecture A typical rate of undetected dangerous failures λDU in an actuated valve assembly is around 002 failures per annum (approximately 1 failure in 50 years or around 23 failures per 106 hours) [8] [11] [12] The average PFD of the valve subsystem may be approximated by

119875119875119875119875119875119875119860119860119860119860119860119860 asymp (1 minus 120573120573)(120582120582119863119863119863119863119879119879)2

3+ 120573120573

1205821205821198631198631198631198631198791198792

The last term in this equation represents the contribution of common cause failures It will dominate the PFD unless the following is true

120573120573 ≪ 120582120582119863119863119863119863119879119879

If the test interval T is 1 year and λDU = 002 pa then the common cause failure term will be greater than the first term unless β lt 2 and such a low value is difficult to achieve

For the common cause failure term to be negligible the test interval T would usually have to be significantly longer than 1 year andor the β-factor would have to be much less than 2

The SINTEF Report A26922[10] suggests that in practice the common cause failure fraction can be expected to be greater than 10 Typical values achieved are in the range 12 to 15

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 46: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 7

Clearly the PFD for undetected failures is directly proportional to β λDU and T

For detected failures the PFD is directly proportional to the rate of detected failures λDD and the mean time to restoration (MTTR)

Wrong but useful

The SINTEF Report A26922 includes the pertinent quote from George EP Box

lsquoEssentially all models are wrong but some are usefulrsquo

The quote is in the context of a discussion regarding different models that may be used to estimate the common cause factor β Similarly we can conclude that although the models used for calculating PFD are wrong they are still useful

The OREDA failure statistics show that failure rates of mechanical components are not fixed and constant and the band of uncertainty spans more than one order of magnitude The statistics suggest that the failure rates of sensors are also not constant

The published failure rates are useful as an indicator of failure rates that may be achieved in practice

The PFD cannot be calculated with precision because the failure rates are not constant It is misleading to report calculated PFD with more than one significant figure of precision

Application of Markov models Petri nets and Monte Carlo simulations leads to an unfounded expectation of precision The results are much more precise but no more accurate

The PFD calculated for any safety function is necessarily limited to an order-of-magnitude estimate This is sufficiently precise to estimate the risk reduction factor with at best one significant figure of precision That precision is enough to categorise the function by the safety integrity level (SIL) that is feasible

The failure rate that is assumed in the calculation can then be used to set a benchmark for the failure rate to be achieved in operation

Setting feasible targets

During the architectural design of safety functions the PFD is calculated to show that it will be feasible to achieve and maintain the required risk reduction

In the process sector the final elements are usually actuated valves though some safety functions may be able to use electrical contactors or circuit breakers as final elements

The example value of 002 pa quoted above for λDU is a typical failure rate that is feasible to achieve for infrequently operated actuated valves For contactors or circuit breakers it is feasible to achieve failure rates lower than 001 pa

The SIL 1 range of risk reduction can be achieved without hardware fault tolerance (ie 1oo1 architecture) using either a valve or contactor

It is feasible to achieve the SIL 2 range of risk reduction with 1oo1 architecture but the PFD may be marginal particularly if a valve is used as the final element If actuated valves are used attention will need to be given to minimising λDU Alternatively the PFD may be reduced by reducing the interval

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 47: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 8

between proof tests T If these parameters cannot be minimised it may be necessary to use a 1oo2 architecture for the final elements in order to achieve SIL 2

For SIL 3 risk reduction it will always be necessary to use at least a 1oo2 architecture because of the IEC 61511 requirement for hardware fault tolerance If the final elements are actuated valves then the β-factor andor λDU will need to be minimised to achieve even the minimum risk reduction of 1000

This will result in design requirements that improve independence (reducing β) and facilitate inspection testing and maintenance (reducing λDU and improving test coverage) If reducing β and λDU is not sufficient it may also be necessary to shorten the test interval T It is usually impractical to implement automatic continuous diagnostics on final elements

The end result of the PFD calculations is a set of operational performance targets for β λDU T and MTTR These factors are not fixed constants and are all under the control of the designers and the operations and maintenance team

Independent certification of systematic capability

The system designers need to demonstrate appropriate systematic capability Evidence is required to show that appropriate techniques and measure have been applied to prevent systematic failures This must include showing that the systems are suitable for the intended service and environment The level of effectiveness in these techniques and measures must be shown to be sufficient for the intended SIL

Independent certification provides good evidence of systematic capability in the design and construction of the equipment

Evidence of prior use

Systematic capability is not limited to design Evidence is also needed to show that operation inspection testing and maintenance of the systems will be effective in achieving the target failure performance

Operators need to plan operation inspection testing and maintenance based on a volume of prior operating experience that is sufficient to show that the target failure rates are achievable

Measuring performance against benchmarks

IEC 61511-1 sect5253 requires operators to monitor and assess whether reliability parameters of the safety instrumented systems (SIS) are in accordance with those assumed during the design sect1629 requires operators to monitor the failures and failure modes of equipment forming part of the SIS and to analyse discrepancies between expected behaviour and actual behaviour

The SINTEF Report A8788 Guidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phase[9] provides useful guidance on the analysis of failures recorded during operation of a plant It suggests setting target values for the expected number of failures based on the failure rates assumed in the design If the actual measured failure rates are higher than the target it is necessary to analyse the causes of the failures Compensating measures to reduce the number of future failures must be considered The need for increased frequency of inspection and testing should also be considered but it is not sufficient to rely on increased frequency of testing alone

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 48: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 9

Anticipating rather than measuring failures

It may be difficult to measure meaningful failure rates on some types of critical equipment If the hardware failure rates are relatively low and the population of devices is small there may be too few failures to allow a rate to be measured

Detailed inspection testing and condition monitoring can then be used to provide leading indicators of incipient failure Most of the modes of failure should be predictable It should be feasible to prevent failures through condition based maintenance with overhaul renewal or replacement when required

A FMEDA study can identify the key parameters that need to be monitored to detect deterioration and incipient failure The uncertainty in failure rate can be mitigated through a better understanding of the likely failure modes of components and of the measurable conditions that are symptomatic of component deterioration The likelihood of failure depends on the condition of the components

Designing for testability and maintainability

A common problem for plant operators is that the plants are designed to minimise initial construction cost The equipment is not designed to facilitate accessibility for testing or for maintenance

For example on LNG compression trains access to the equipment is often constrained Some critical final elements can only be taken out of service at intervals of 5 years or more Even if the designers have provided facilities to enable on-line condition monitoring and testing the opportunities for corrective maintenance are severely constrained by the need to maintain production Deteriorating equipment must remain in service until the next planned shutdown

It is common practice to install dutystandby pairs of pumps and motors where it is critical to maintain production The 2oo3 voting architecture used for safety function sensors fulfils a similar purpose It facilitates on-line testing of sensors It is not common practice to provide dutystandby pairing for safety function final elements but it is possible Dutystandby service can be achieved at by using 2 x 1oo1 or 2oo4 architectures The justification for the additional cost depends on the value of process downtime that can be avoided

If dutystandby pairing is not provided for critical final elements then accessibility for on-line inspection testing and maintenance must be considered in the design A safety function cannot provide any risk reduction while it is bypassed or taken out of service during normal plant operation The probability of failure on demand is directly proportional to the time that the safety function is out of service (characterised as mean time to restoration MTTR)

Designing the system to facilitate inspection testing and maintenance enables both the λDU and the MTTR to be minimised in operation

Preventing systematic failures

Systematic failures cannot be characterised by failure rate The probability of systematic faults existing within a system cannot be quantified with precision But by definition systematic failures can be eliminated after being detected The implication of this is that systematic failures can be prevented

Due diligence must be demonstrated in preventing systematic failures as far as is practicable in proportion to the target level of risk reduction Plant owners need to satisfy themselves that

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 49: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 10

appropriate processes techniques methods and measures have been applied with sufficient effectiveness to eliminate systematic failures The attention given to SIL 3 safety functions needs to be proportionately higher than for SIL 1 functions Owners and operators need to be able to demonstrate that reasonable steps have been taken to prevent failures and must measure and monitor the effectiveness of those steps

The main purpose of IEC 61508 and IEC 61511 is to provide management frameworks that facilitate prevention of preventable failures The standards describe processes techniques methods and measures to prevent avoid and detect systematic faults and resulting failures

Activities techniques measures and procedures can be selected to detect or to prevent faults and failures IEC 61511-1 sect623 requires planning of activities criteria techniques measures and procedures throughout the safety system lifecycle The rationale needs to be recorded

Preventing lsquorandomrsquo failures

This same approach of active prevention should be extended to include the management of the random failures that are not purely random Most failures that are usually classed as random are actually preventable to some extent This includes all common cause failures

Conclusions Wide variation in failure data

Designers of safety functions estimate probability of failure by assuming that failures occur randomly and with fixed constant failure rates The precision in the estimates is limited by the uncertainty in the failure rates OREDA statistics clearly show that the failure rates vary over at least an order of magnitude The rates are not fixed and constant The reported failure rates serve as an indication of the failure rates that can be feasibly achieved with established practices for operation inspection maintenance and renewal

Confusion between random and systematic

There is no clear and consistent definition to distinguish random failures from systematic failures Most failures fall somewhere in the middle between the two extremes of purely random and purely deterministic Most failures are preventable if the failure mode can be anticipated and inspections and tests can be designed to detect incipient failure In practice failures are not completely preventable because access to the equipment and resources are limited Failures are treated as quasi-random to the extent that it is not practicable to prevent the failures

PFD Calculations set feasible performance benchmarks

PFD calculations are based on the coarse assumptions that failure rates and the proportion of common cause failures are fixed and constant These assumptions enable the PFD and RRF to be estimated within an order of magnitude given reasonably effective quality control in design manufacture operation and maintenance The onus is then on the operators to demonstrate that the failure rates of the equipment in operation are no greater than the failure rates that were assumed in the PFD calculation

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 50: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 11

Strategies for improving risk reduction

1 The first priority in functional safety is to eliminate prevent avoid or control systematic failures throughout the entire system lifecycle Failures are minimised by applying conventional quality management and project management practices This includes designing and specifying the equipment to be suitable for the intended service conditions and the intended function It includes the achievement of an appropriate level of systematic capability

The level of attention to detail and the effectiveness of the processes techniques and measures must be in proportion to the target level of risk reduction SIL 3 functions need much stricter quality control than SIL 1 functions

2 The second priority in functional safety is to enable early detection and effective treatment of the deterioration that cannot be prevented The design of the safety functions needs to take into account the expected failure modes and to include requirements for diagnostics accessibility inspection testing maintenance and renewal

The requirements for accessibility depend on the target failure rates that need to be achieved for the target risk reduction The requirements also depend on the cost of downtime To achieve SIL 3 safety functions will always need to be designed to enable ready access for inspection testing and maintenance

The planning for inspection and testing should be in proportion to the target level of risk reduction Planning for maintenance and renewal should also be in proportion to the target level of risk reduction and should be based on the measured condition of the equipment

3 Avoidance and prevention of common cause failures is of primary importance in the design and operation of safety functions Common cause failures dominate the PFD in all voted architectures of sensors and of final elements

4 In the operations phase measurements of failure rates and of equipment deterioration provide essential feedback on the effectiveness of the design inspection testing maintenance and renewal The measured failure rates should be compared with the rates assumed in the PFD calculations

Root cause analysis of all failures is necessary to identify common cause failures and to identify strategies for preventing similar failures in the future

If the measured failure rates are higher than the target benchmark then the reasons need to be understood and remedial action taken Leading indicators of failure can be developed based on the measurement of deterioration and the anticipation of incipient failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES
Page 51: Preventing preventable failures - I&E Systems€¦ · exSILentia. software, and the SERH ‘Safety Equipment Reliability Handbook’ • FARADIP database • Users’ own failure

Presented at the 13th International TUumlV Rheinland Symposium - May 15-16 2018 Cologne Germany 12

References

[1] lsquoFunctional safety ndash Safety instrumented systems for the process industry sector ndash Part 1 Framework definitions system hardware and application programming requirementsrsquo IEC 61511-12016

[2] lsquoEquipment reliability testing Part 6 Tests for the validity and estimation of the constant failure rate and constant failure intensityrsquo IEC 60605-62007

[3] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems ndash Part 4 Definitions and abbreviationsrsquo IEC 61508-42010

[4] lsquoFunctional safety of electricalelectronicprogrammable electronic safety-related systems - Guidelines on the application of IEC 61508-2 and IEC 61508-3rsquo IEC 61508-62010

[5] lsquoPetroleum petrochemical and natural gas industries mdash Reliability modelling and calculation of safety systemsrsquo ISOTR 12489

[6] lsquoSafety Integrity Level (SIL) Verification of Safety Instrumented Functionsrsquo ISA-TR840002-2015

[7] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Method Handbookrsquo SINTEF Trondheim Norway Report A24442 2013

[8] lsquoReliability Prediction Method for Safety Instrumented Systems ndash PDS Data Handbookrsquo SINTEF Trondheim Norway Report A24443 2013

[9] S Hauge and M A Lundteigen lsquoGuidelines for follow-up of Safety Instrumented Systems (SIS) in the operating phasersquo SINTEF Trondheim Norway Report A8788 2008

[10] S Hauge et al lsquoCommon Cause Failures in Safety Instrumented Systemsrsquo SINTEF Trondheim Norway Report A26922 2015

[11] OREDA Offshore and Onshore Reliability Data Handbook Vol 1 6th ed SINTEF Technology and Society Department of Safety Research Trondheim Norway 2015

[12] Safety Equipment Reliability Handbook 4th ed exidacom LLC Sellersville PA 2015

[13] FHC Marriott lsquoA Dictionary of Statistical Termsrsquo 5th Edition International Statistical Institute Longman Scientific and Technical 1990

  • Preventing preventable failures
  • The need for credible failure rates
  • lsquoWrong but usefulrsquo
  • We need a different emphasis
  • Hardware fault tolerance
  • HFT in IEC 61511 Edition 2
  • HFT in IEC 61511 Edition 2
  • So what is confidence level
  • l90 compared with l70
  • Can we be confident
  • Where can we find credible data
  • Combining data from multiple sources
  • Wide variation in reported l
  • A long-standing debate
  • The intent of the standards
  • What is random
  • What is random
  • Guidance from ISOTR 12489
  • Guidance from ISOTR 12489
  • Purely random failure
  • Definition in IEC 61511 and IEC 61508
  • Random or systematic failures
  • Quasi-random hardware failures
  • Failure probability depends on failure rate
  • The basic assumption in failure probability
  • PFD for 1oo2 final elements
  • PFD for 1oo2 final elements
  • Dependence on failure rate
  • Managing and reducing failures
  • Typical feasible values for 120582 119863119880
  • Typical feasible values for 120573
  • Typical values for 119879 depend on application
  • What architecture is typically needed
  • Design assumptions set performance benchmarks
  • Monitoring performance is essential
  • Design for maintainability to enable performance
  • Suitability of components
  • Emphasis on prevention instead of calculation
  • Slide Number 39
  • Prevent Preventable Failures - paperpdf
    • PREVENT PREVENTABLE FAILURES