100331 PLC DCS Maintenance Plan and Reliability

Automation systems (PLC, DCS) maintenance plan and reliability By Carlo Lebrun (Ecisgroup Spa)

Some definitions first Maintenance could be defined as:

“All actions performed with the objective to keep an item in its full functional state.”

In general maintenance can be further defined according to the following types:

1. Breakdown maintenance: An item gets repaired when it breaks or fails.

2. Preventive maintenance: regular activity (cleaning, lubricating, replacing, etc.) is performed in order to keep the functional condition. It can be further split into periodic maintenance (based on absolute time ) and predictive maintenance (based on working conditions, e.g. 2000 hrs running at 90% of load).

3. Corrective maintenance: It improves equipment with design issues. Weak design requires improvement to keep or to improve functionality. Note: the term is often but improperly used instead of “breakdown maintenance”.

4. Maintenance prevention: design actions are performed to reduce maintenance requirements, based on the analysis of past experience with similar equipment (lesson learned).

And to keep our installed equipment in good shape we should perform at least the first two:

‐ restore the full functionality every time a certain type of failure is detected by the system itself

‐ regularly inspect the system, and re‐establish the full functionality in case the inspection shows some performance decrease

Today we often tend to Reliability Centered Maintenance (RCM). This is meant as the optimal balance between lower activity cost and higher equipment reliability.

The technical standard SAE JA 1011 “Evaluation Criteria for RCM Processes” sets the criteria to implement Reliability Centered Maintenance. It is based on the questions here below, to be answered in the given sequence:

1. What is the item supposed to do and its associated performance standards?

2. In what ways can it fail to provide the required functions?

3. What are the events that cause each failure?

4. What happens when each failure occurs?

5. In what way does each failure matter?

6. What systematic task can be performed proactively to prevent, or to decrease the consequences of the failure?

7. What should be done if a suitable proactive task cannot be found?

Maintenance improves Reliability Automation systems like PLC , DCS, ESD, HIPPS logic solvers, etc. require maintenance just as any other equipment. Regular preventive maintenance helps to keep the desired functionality and minimize the risk of failures.

Although they are apparently static and reliable devices they are subject to changes of their working conditions which may affect or breach their functionality.

When the automation equipment is also in charge of a protective function (like an HIPPS logic solver, and we call it Safety Instrument System: SIS) we then fall in the field of functional safety (portion of safety that depends on the correct functionality of an equipment).

As a consequence if we deal with functional safety requirements (often defined as SIL classification) we are then obliged to respect and document a precise maintenance plan, together with a periodic total or partial verification of the equipment functionality, the “proof test interval”.

The declared reliability of a PLC (or other “logic solver”) is not an intrinsic and stable feature. It is a dynamic condition, which can change due to several causes, including the correct actuation of the planned maintenance. The usual metric is the Probability of Failure on Demand (PFD). PFD is normally considered as increasing with time. A 100% proof test of functionality would make the PFD restart from the lowest level. And SIL classification is based on the Average PFD during the Safety system lifecycle.

Maintenance will ensure that any known possible degradation of functionality is prevented or detected on time.

The value of declared failure rates We often hear that electronic equipment has its own failure rate. It is not always easy to understand if a failure of an electronic module is due to an intrinsic feature or to some external cause. But indeed we have to admit that we are not really able to track all possible causes. So let’s accept there is an intrinsic failure rate for each equipment.

When we read a manufacturer IEC61508 certification for a component, we should read the declared data as the intrinsic failure rates, assuming perfect design, installation, environmental conditions, etc.

And if the manufacturer declares only the SIL or the PFD values that’s even worst: these should be taken as the most optimistic data, assuming some hidden proof test time interval, some hidden percentage of common cause of failures, etc.

We should carefully evaluate the applicability of all these assumptions.

Do you know any PLC/DCS manufacturer who declared their equipment failure rates instead of just the PFD values in their IEC61508 certification? I don’t.

Do you know any PLC/DCS manufacturer who declared their assumptions in their IEC61508 certification? I don’t.

Negative answers, but it doesn’t really matter too much: their evaluation are usually based on theoretical calculations based on the detail list of electronic subcomponents, each of them with a specific failure rate.

Can you believe the result? I don’t. But you can use it as a starting value …

Therefore the IEC61508 manufacturer declaration should allow us to estimate the “ideal” failure rate of the equipment in use.

As an alternative we can use a failure rates database issued by some organization or authority working with functional safety (Exida, OREDA, OLF, etc.). At least these guys declare clearly their assumptions.

Here is a table A.3 portion from APPLICATION OF IEC 61508 AND IEC 61511 IN THE NORWEGIAN PETROLEUM INDUSTRY Rev. 2 by OLF (also called “OLF‐070 guidelines”, see www.itk.ntnu.no/sil )

Possible causes of failures As we said the declared failure rates mostly assume that everything is perfect out of the boundaries of our component, which is not always true.

But several other causes may affect the frequency of failures.

We recall here several possible issues that we are aware of (there are surely other!):

‐ Ambient (and/or cabinets) temperature above tolerated limits: overheating can degrade components, and even cause fires. Normally different conditions are specified for operation and for storage.

‐ Ambient (and/or cabinets) temperature oscillations: thermal expansion could materials degradation or repeated movement up to loss of electrical contact

‐ Moisture / humidity above tolerated limits: Humidity can provoke short circuits, and facilitate corrosion or materials degradation.

‐ Dust: deposits can limit the thermal exchange and promote overheating. If we consider that our equipment is normally installed in ventilated cabinet dust can also plug the filters and reduce the ventilation effect. In some cases dust can be conductive (from metals or from coal) and can provoke short circuits.

‐ Corrosion: exposure to corrosive environments can cause materials degradation. Corrosion maybe often caused or enhanced by dust and by humidity(even when below the tolerated limits!)

‐ Electromagnetic interference: radio waves can affect electrical equipment. Nowadays the technology improvements have significantly reduced the importance of this issue.

‐ Power Supply stability: fluctuations beyond the specified limits can affect PLC/DCS functionality. The use of separate power supplies for separate functions is mostly recommended. Avoid to use automation equipment power supply for other purposes like e.g. soldering, welding or brewing coffee.

‐ Vibration: vibration could provoke physical breaks and components or cables disconnection, with loss of electrical contact. Vibration could be caused by earthquakes, but also by machinery in operation and by human activities during construction.

‐ Age: it may degrade some materials like e.g. condensers and transformers filling and isolation(dielectric materials). This effect is normally defeated by regular replacement of some components (typically power supply units) every specific time interval.

‐ Grounding / Earthing: bad grounding can cause power supply instability and also network communication problems. Complete isolation from rest of the plant is essential. Neutral to ground voltage to be less than 1 V. All metal parts of the cabinet and in the cabinet shall be connected to ground.

‐ Induced currents: incorrect routing of power cables close to automation system cables (e.g. I/O wirings) can cause induced currents which may lead to signal spikes or even I/O modules failure. Shielded and grounded cables for power and control circuits are usually recommended.

‐ Do not allow the welding or any high current drawing activity near the cabinet or power from the cabinet.

‐ I/O short circuits: I/O cards should be isolated from external shorts

‐ Wildlife: ants, rats, squirrels, bats can damage equipment and gnaw cables. Sometimes they do not cause a direct damage but they can hinder maintenance (wasps, etc.)

‐ Design problems: let’s not forget they can exist.

‐ Maintenance: mistakes, distractions, underestimated impact on other equipment, missing or wrong documentation, etc. can also cause actions on the wrong piece of equipment.

‐ Human unauthorized access: voluntary sabotage or involuntary damage action should also be considered in the development of access procedures to control rooms and cabinets security.

‐ Housekeeping: beware of external objects ,wastes, and swarf in the cabinets, or around, or above, etc. Some could have different environmental conditions tolerance compared with the real equipment. Some materials could be ignited by heating. Some could obstruct ventilation. Some could be conductive and provoke short circuits. Take also care of correct doors opening and housekeeping of working spaces around the cabinets.

If the frequency of failures in our plant is clearly higher than the theoretical failure rate it is worth to investigate further on the listed topics.

Do you suffer too many DCS failures? A shortcut assessment A standard industrial PLC is considered to have a failure rate 5 failures / 10e6 hours (as per the table included above). This number is given for a complete system composed of 1 CPU + 2 I/O cards.

This gives 5/3 failures / 10e6 hours per each card, which makes 0.0146 failures/year per each hardware module card. If your complete system is composed of 100 modules you should then expect about 1.5 failures/year.

Too rough? It is. But you should consider the result purely qualitatively, and look at the order of magnitude only.

If you experience 5 failure/year you are still in the same order of magnitude.

If you experience e.g. 1 failure/month you should then investigate for some external failure cause.

Planning preventive PLC / DCS maintenance For an optimal planning of PLC and DCS preventive maintenance we should also balance the value of any check and inspection with the risk associated with it. In fact some tests can require to temporarily force the system to a degraded functionality, and this may be not acceptable: e.g. we do not expect you test the CPU backup switch over during plant in operation.

We could split our preventive maintenance activities like this, assuming we already completed FAT, SAT and commissioning:

‐ Periodical maintenance during plant operation: it will include all checks that can be safely done during plant operation, and all checks which MUST be done only during operation.

‐ Periodical maintenance during plant shut down

‐ Periodical Proof test: it is normally required for SIS protective equipment, with a specific minimum frequency (with interval T1) to maintain the desired reliability target. It can be perfect (test of 100% of functionality) or non‐perfect (test of less than 100% of functionality). In both cases it is a key parameter in the average PFD estimation during the system lifecycle. The Proof Test procedure should clearly define the associated plant conditions: in operation or shut‐down.

Here is a suggested preventive maintenance plan for a BPCS (basic process control system), without need of SIL certification:

Plant in Operation Shut‐Down Plant Proof Test at T1 interval

Visual inspection of cabinet external and internal conditions (housekeeping, signs of deterioration, correct opening and access space, etc.)

Every 6 months At every planned Shut‐Down Not applicable

Visual Inspection of system composition and layout (compared to specs)


Verify environmental conditions

AT LEAST Every 6 months or continuously by data logging

At every planned Shut‐Down or continuously by data logging

Not applicable

Cleaning of mechanical devices (peripherals, fans, filters, etc.)

Every 6 months (actual period can be “tuned” based on past experience and control room cleaness)

At every planned Shut‐Down Not applicable

Check of power supply backup swap

NO, unless strictly necessary (*)

At every planned Shut‐Down Not applicable

Check quality of power supply and grounding (tension, oscillations, backup batteries, etc.)


Periodical replacement of power supply transformer etc.


According to manufacturer schedule

Not applicable

Check redundancy and swap of CPU, I/O, communication cards, etc.


At every planned Shut‐Down if Proof Test is not planned

Not applicable

Backup copy of application software

At every modification (for SIS) or as per backup schedule (DCS, etc)

NO Not applicable

Firmware upgrade NO, unless strictly necessary (*)

If required Not applicable

Application software and hardware documentation update

At every modification At every modification Not applicable


Maintenance activity log

At every action, planned or exceptional

At every action Not applicable

Verify functionality of the engineering station


At every Plant Shutdown Not applicable

Verify electrical components (fuses, relays, terminals, etc.)



Not applicable

Other checks and improvements analysis

In case system or conditions changes are promoted or detected, and possibly before every planned plant shutdown.

Planned plant shutdown are a good opportunity for changes and improvements

Not applicable

Review of maintenance and security procedures

AT LEAST in case of accident Planned plant shutdown are a good opportunity for changes and improvements

Not applicable

SW / HW Functionality of applications


Allowed. Not essential. Not applicable

SW / HW Functionality of HMI (keyboards, push buttons, graphic displays, etc.)

NO, unless strictly necessary (*). An exception: the lamps test is normally allowed and harmless.

Allowed. Not essential. Not applicable

System Performances (CPU load, system memory, etc.)

Some feature are accessible only during plant operation. Log if possible.

Some feature are accessible only during plant shutdown

Not applicable

Note (*): actions which are not recommended as preventive action during plant operation could be anyway required in case of breakdown maintenance.

Here is the equivalent list of preventive maintenance plan for a SIL‐rated PLC protection system (HIPPS, ESD, etc.):


Visual inspection of cabinet external and internal conditions (housekeeping, signs of deterioration, correct opening and access space, etc.)

Every 6 months At every planned Shut‐Down Yes (**)


Visual Inspection of system composition and layout (compared to specs)

Every 6 months At every planned Shut‐Down Yes (**)

Verify environmental conditions

AT LEAST Every 6 months or continuously by data logging

At every planned Shut‐Down or continuously by data logging

Yes

Cleaning of mechanical devices (peripherals, fans, filters, etc.)

Every 6 months (actual period can be “tuned” based on past experience and control room cleaness)

At every planned Shut‐Down Yes

Check of power supply backup swap


At every planned Shut‐Down Yes (**)

Check quality of power supply and grounding (tension, oscillations, backup batteries, etc.)

Every 6 months At every planned Shut‐Down Yes

Periodical replacement of power supply transformer etc.



Yes

Check redundancy and swap of CPU, I/O, communication cards, etc.


At every planned Shut‐Down if Proof Test is not planned

Yes (**)

Backup copy of application software

At every modification (for SIS) NO Not essential. Changes should be managed properly (reviewing SIL impact).

Firmware upgrade NO, unless strictly necessary (*)

If required Not essential. Changes should be managed properly (reviewing SIL impact).

Application software and hardware documentation update

At every modification At every modification Yes. Changes should be managed properly (reviewing SIL impact).

Maintenance activity log

At every action, planned or exceptional

At every action Yes. For SIL certified system this is mandatory.

Verify functionality of the engineering station


Allowed Not essential. Changes should be managed properly (reviewing SIL impact).

Verify electrical components (fuses, relays, terminals, etc.)



Yes (**)

Other checks and In case system or conditions changes are promoted or

Planned plant shutdown are a good opportunity for changes

AT LEAST in case of accident. Changes should be managed


improvements analysis detected, and possibly before every planned plant shutdown.

and improvements properly (reviewing SIL impact).

Review of maintenance and security procedures

AT LEAST in case of accident Planned plant shutdown are a good opportunity for changes and improvements

AT LEAST in case of accident

SW / HW Functionality of applications


Allowed Yes (**)

SW / HW Functionality of HMI (keyboards, push buttons, graphic displays, etc.)

NO, unless strictly necessary (*). An exception: the lamps test is normally allowed and harmless.

Allowed. Yes (**)

System Performances (CPU load, system memory, etc.)

Some feature are accessible only during plant operation. Log if possible.

Some feature are accessible only during plant shutdown

Yes (**)

Note (*): actions which are not recommended as preventive action during plant operation could be anyway required in case of breakdown maintenance.

Note (**): detail procedures can be referred (or even totally match) Factory Acceptance / Site Acceptance Tests procedures. These should in fact considered as the correct procedures to test functionality at 100%. Proof test can be also deliberately shortcut some test procedures and target less than 100% (a socalled non‐perfect proof test).

Picture 1: US Navy Construction Electrician checks dirty and corroded wires and circuit breakers from a generator damaged by water during the Tsunami that hit Indonesia in Dec. 26, 2004 (Public Domain photo by United States Navy , ID 050125‐N‐9712C‐001).

Picture 2: A clean cabinet filter (left) against a dirty cabinet filter (right). (photo Ecisgroup SpA)

Documents

100331 PLC DCS Maintenance Plan and Reliability