Download pdf - Relay Performance Index for a Sustainable Relay ... · Relay Performance Index for a Sustainable Relay Replacement Program Aaron Feathers, Abesh Mubaraki, Ana Nungo, Nai Paz Pacific

Relay Performance Index for a

Sustainable Relay Replacement Program

Aaron Feathers, Abesh Mubaraki, Ana Nungo, Nai Paz

Pacific Gas & Electric Company

Abstract - Many utilities now face a complex tangle of

challenges in managing the performance and reliability of

multiple generations of relays in service, as they formulate the

best strategy for designing and sequencing replacements.

The paper explores the challenges of creating a sustainable

relay upgrade and replacement program for a large utility with

a mixed fleet of new and old relay installations. PG&E has a

fleet of 35,000 relays from 12kV through 500kV. The relay fleet

is comprised largely of electromechanical and microprocessor

based relay types, with a small number of solid state. Budget

and resource limitations prohibit a strictly age based relay

replacement strategy. The paper will focus on the strategy from

data analytics to assess the risk of each relay in the fleet based

on a relay health and criticality score, and how this is used as a

basis for replacement prioritization.

Various relay risk factors are analyzed and weighted

including failure rate, misoperation rate, age, relay class,

scheme type, bus configuration, and customer count. The paper

will detail how each factor is weighted and the basis.

Characteristics of different relay types or generations for

longevity, and failure modes will be discussed and the impact to

the replacement strategy.

Tools for managing the protective relay fleet - asset,

maintenance, and configuration databases will be described, and

analysis of the data they capture.

The paper will show PG&E relay fleet data for the last 6

years and how the relay replacement strategy has affected relay

fleet statistics and performance.

Protective system asset owners can benefit from the fleet

management strategies presented to meet the demands of

today’s operating and replacement pressures, including long

range replacement plans and asset end-of-life decisions.

I. BACKGROUND

Pacific Gas and Electric Company (PG&E) is one of the

largest combination electric and gas utilities in the United

States. It serves about 15 million customers in northern and

central California. Approximately 20,000 employees serve its

70,000 square mile territory. It is a vertically integrated

utility with Generation, Transmission and Distribution assets.

Its transmission system is made up of approximately 18,300

miles of 500, 230, 115, 70 and 60 kV lines. Total number of

substations is approximately 860, with 35,000 relays.

Approximately 50% percent of the relays are microprocessor

type and 50% electromechanical type.

PG&E System Protection department started monitoring

relay performance (failures) in 2008 that was spurred on due

to quality issues with newer relays from a specific

manufacturer. Shortly thereafter discussions began taking

place regarding the aging PG&E relay fleet and concerns

over large numbers of relays reaching the end of expected life

for both first generation microprocessor relays and also a

large number of very old electromechanical relays. In the last

6 years PG&E has been accumulating relay performance and

relay fleet data and refining how this data is analyzed to

create a sustainable relay replacement strategy, as one leg of

a sustainable relay asset strategy.

PG&E and Quanta Technology co-authored a 2012

WPRC paper titled “Creating a Sustainable Protective Relay

Asset Strategy” which outlined the many facets of such a

strategy [1]. This paper is an extension of that one focusing

specifically on the relay asset and performance data, and how

PG&E has been analyzing and using this data over the last

two years to create a sustainable relay replacement strategy.

II. INTRODUCTION

Many utilities now face challenges in managing the

performance and reliability of multiple generations of

protective relays. Developing a sustainable relay replacement

strategy is necessary to maintain reliability of these devices.

This can be a simple or complex task depending on how a

utility approaches this challenge. A simple age based relay

replacement strategy may be effective, but may also be the

most costly. Competing budget priorities for limited funding,

resource limitations, and the large numbers of relays reaching

end of life may prohibit a strictly age based replacement

program, which was the case for PG&E.

To develop a sustainable relay replacement strategy, the

responsible team needs to:

Understand the differences in relay types or

generations

Assemble and track relay asset, maintenance, and

failure/performance data

Develop tools to analyze the data

Develop and deploy a strategy

The following sections address each of these topics. The

paper demonstrates how the PG&E team developed a relay

performance index and a sustainability model that forecasts

expected failures, life expectancy, survivability and average

age over time to help determine the replacement strategy.

III. CHARACTERISTICS OF RELAY GENERATIONS

Protection systems have evolved from assemblies of

single function electro-mechanical relays to complex

multifunctional microprocessor relays. Relay technology has

migrated from electromechanical to solid-state and then

microprocessor based devices. Each of these relay classes

has different characteristics which must be recognized for

creating an effective relay asset strategy. Expected life or

practical life of the devices, failure modes, and maintenance

requirements are some differences to understand to name a

few.

2

A. Electromechanical (EM) Relays

Electromechanical relays are the roots of system

protection. Typical life expectancy of these devices can be

up to 40 plus years and there are still many that are in service,

some up to 70 years old at PG&E. The oldest of these are

simple overcurrent relays that continue to perform. More

complex relays like distance relays with metal-can bathtub

capacitors have a shorter life as these components wear out.

One characteristic of electromechanical relays is that

they fail silently. An electromechanical relay failure is not

evident until it is discovered either through routine

maintenance, or following an operation inquiry due to relay

misoperation. The EM relays don’t typically fail but rather

drift and require calibration during routine maintenance. As

parts wear out in the relay it may not be able to be calibrated

within acceptable parameters and is considered failed.

Panel design for EM relays has a large number of

discrete relays, one per phase or zone. This yields some

inherent redundancy in EM relay schemes for relay failure.

B. Solid-State (SS) Relays

Solid state relays are the bridge between the

electromechanical relay generation and current

microprocessor relay generation. Solid state relays have an

expected life of about 20 years. Most of the solid state relay

generation are at the end of their expected life and will be

replaced, prior to many of their electromechanical counter

parts which have a longer service life. For this reason solid

state relays have been called a lost generation, while large

numbers or electromechanical and microprocessor relays

remain.

Solid state relays are typically set with dip switches and

dials, some more advanced types had menu driven settings,

while a few had software to interface with the device.

Software associated with these devices is typically not

supported by current computer operating systems. Solid state

relays at end of life may exhibit high failure and misoperation

rates due to electronic component failures. These failures

may not be evident until a routine test reveals the problem.

C. Microprocessor (MP) Relays

Microprocessor relays are the current generation of

protective relays. They began to be commonly applied in the

1990’s. First generation microprocessor relays are reaching

end of life, which has spurred industry concern over their

replacement. Expected useful life of MP relays is about 15 to

20 years. MP relays are highly reliable until certain

components begin to reach end of life, such as electrolytic

capacitors in the power supplies. MP relay useful life may

also be limited by the capability to support the device;

configuration software for older relays may no longer operate

on newer computers. Useful life may also be impacted by

compliance driven requirements, such as NERC CIP, which

could drive replacement.

MP relays have vast arrays of functions and capabilities.

A single MP relay can replace an entire panel of EM relays.

To achieve redundancy a second MP relay is installed to

cover the single failure criterion. MP relays include self-

diagnostic capabilities and relay failure alarming. The relay

will disable itself and alarm for the majority of hardware or

software failures. MP relays do not require calibration like

an EM relay so routine maintenance is less labor intensive.

IV. PG&E RELAY FLEET

In 2008 PG&E System Protection department began

work on developing a sustainable relay asset management

strategy. This emphasis followed being given responsibility

as the relay asset owner. Concerns were emerging over the

aging MP relays in the PG&E fleet approaching end of life

with no long range plans in place to address replacement.

Figure 1 below shows the initial assessment of the PG&E

relay fleet in 2008. The number of relays installed each year

is shown, broken down by relay class. This was a light bulb

moment for PG&E, realizing large numbers of MP relays

were approaching 15 years in age which may be near end-of-

life for these devices. A large population of EM relays was

also approaching 40 years in age. Several questions arose.

How would these relays perform as they continued to age?

What is an appropriate service life for a MP relay, SS relay,

and EM relay? How many relays do we need to replace each

year to maintain reliability? Do we have enough resources

and budget available? These questions could not be

immediately answered.

Figure 1 Age distribution of PG&E relay fleet in 2008

PG&E’s current relay fleet is comprised of

approximately 35,000 relays. Broken down by relay class,

48% are MP, 41% EM, and 11% are SS. The EM and SS

relays as a whole are at expected life expectancy as shown by

the statistics in Figure 2. The fleet is slowing transitioning

from EM to MP relays, but large populations of both exist.

The MP relay fleet on average is young, which is bolstered by

a ramp up in newly installed MP relays in the last six years.

3

Figure 2 Summary of PG&E's relay fleet

If we look at the age distribution for the current relay

fleet (data through 2013) in Figure 3, you can see the ramp up

in newly installed MP relays. Contrasting with Figure 1, you

can also see the reduction in EM relays since 2008 by the

lower peaks.

Figure 3 Age distribution of relays in 2013

It should be noted that the large increase in newly

installed MP relays was not driven by a relay asset strategy.

Nearly all of the newly installed relays had other drivers,

such as capacity, reliability, third party interconnections,

SCADA/Automation, protection deficiencies, or compliance

driven. Many of those projects included drop in place control

buildings, which replaced all relays in the existing control

rooms, regardless of age. This can be seen in Figure 4,

below. Relays that were replaced were not targeted by age

and had an even distribution across the entire relay fleet.

Figure 4 Relays replaced from 2008 to 2013

The affect of PG&E’s year by year relay replacements

on the fleet can be seen in the following figures. The relay

fleet is transitioning to MP type relays, with reduction in EM

and SS relays as shown in Figure 5. You can also see that the

overall number of relays in the fleet is being reduced, due to

the increase in multi-function MP relays that replace many

discrete EM relays.

Figure 5 Relay fleet profile by year

Figure 6 shows the average age of the relay fleet by relay

class. The EM relay average age is increasing each year

since minimal new EM relays are installed and the existing

EM fleet continues to age with whatever relays that remain.

The MP relay fleet average age is not increasing, but this is

skewed by the large increase in newly installed MP relays.

The older, first generation, MP relays continue to age and the

numbers of MP relays beyond 15 years in age continues to

increase as shown in Figure 7. The percent of relays beyond

expected life span is shown in Figure 8. Comparing Figure 7

and Figure 8 you can see that even though the number of EM

Key Questions:

Are the right relays being

replaced?

What are the effects of non-

targeted relay replacements?

4

relays beyond 40 years is decreasing, the percent of relays

beyond 40 years is increasing for those relays that are left.

Figure 6 Age of relay fleet trend

Figure 7 Number of relays beyond expected life span

Figure 8 Percent of relays beyond expected life span

V. PROTECTION SYSTEM PERFORMANCE DATA

To analyze relay performance, PG&E currently tracks

relay failures, and classifies them as a trip or “safe-mode”

failure depending on whether the failure resulted in a

misoperation or a nonoperation. Trips may or may not result

in an outage that affects customers. While unacceptable, a

failure resulting in a nonoperation is tolerable because

multiple levels of redundancy designed in the system ensures

that impact to equipment and safety is still minimal. In

contrast, a failure that causes a relay to misoperate is not

tolerable since it could cause an outage, affecting customers

and possibly system stability.

Over the course of six years (2008-2013) approximately

one percent of PG&E’s relay fleet has failed as shown in

Figure 9. The number of failures that resulted in trips is about

one tenth of one percent, yet relay failures contribute about

10% annually (5 year average) to the overall Substation

System Average Interruption Frequency Index (SAIFI) and

System Average Interruption Duration Index (SAIDI) indices

as shown in Table 1. In 2013, two major outages in the

PG&E system were a direct result of a misoperation caused

by relay failure. The graph shows that only a very small

percentage of relays fail, and of those that fail a high

percentage fail in a safe mode without any relay action or

resulting trip; the failure of about ten relays per year result in

a trip. However, for the small percentage of relays that fails

insecurely and causes a trip the consequences can be minimal

to severe depending on the protection scheme type and

location. If a method could be developed to specifically target

and replace these high risk relays, the benefits would be

substantial.

Figure 9 Relay failure performance

Table 1 Summary of relay failures that contributed to SAIFI

and SAIDI

Year 2009 2010 2011 2012 2013 5 yr.

Avg.

Relay failures

contribution to

substation SAIFI

(%)

11% 5.4% 8.9% 8.6% 18.1% 10.3%

Relay failures

contribution to

substation SAIDI

(%)

5.1% 2.7% 9.6% 6.5% 33.3% 10%

Figure 10 below shows PG&E relays failures from 2008

to 2013 broken down by relay class. It can be seen that the

far majority of reported relay failures are MP relay types.

EM relays fail silently and are unknown until found during

maintenance or operation inquiry and are underreported. SS

relay failures reported are few, but significant considering the

small population base. A large percentage of the SS relay

failures are reported due to misoperations due to aging

components in devices near end of life.

PG&E has seen a decreasing trend in reported relay

failures and some of this has been achieved due to close work

with relay manufacturers to address manufacturing defect

issues, relay firmware, or service advisory related such as

proactive replacement of relay power supplies. However the

5

large decrease cannot be entirely accounted for and part may

be due to underreporting of relay failures and an uptick has

been seen in 2014 data which is not shown here.

Figure 10 Relay failures 2008 to 2013

If any lesson can be learned from tracking relay failures

and the consequences, it is that not all MP relays will fail

securely, some will trip. MP relays have self-diagnostics that

will disable the relay for the majority of relay failures, but

some failure modes cannot be detected and may fool the relay

and cause a trip, such as an Analog-to-Digital module failure.

A large outage for PG&E in 2013 was caused by a single

MP relay failure in a low-impedance bus differential scheme.

The bus differential scheme had a separate MP relay for A, B

and C phases where each relay made independent bus

differential trip decisions. Following the outage, a corrective

measure was to add undervoltage supervision between the

bus differential relays to prevent a single relay failure from

causing a similar outage. For example, the A phase bus

differential relay must receive an A phase undervoltage

condition being monitored by the B or C phase relay. Each

relay monitors all three phases of the bus potential.

Figure 11 below shows how many PG&E relay failures had a

resulting relay trip (misopoeration) and the number of

outages affecting customers that resulted.

Figure 11 Relay failure caused trips and resulting outages

Relay failure data can be looked at in finer granularity by

age of the devices, specific manufacturer or even model. The

confidence in the results of analysis will depend on the

amount and quality of available data. Figure 12 below shows

MP relay failures by the age of the relays for a single (most

applied) manufacturer in the PG&E system. The graph

shows increasing MP relay failure rates after 15 years.

Figure 12 MP relay failure rate

VI. ASSET AND FLEET MANAGEMENT STRATEGIES

Asset management strategies cover the entire spectrum

from simple to complex depending on the type of asset, and

nearly every strategy requires, to varying degrees, some form

of inspection, maintenance, and recordkeeping. One simple

strategy would be a “run to failure” (reactive) approach that

requires virtually no recordkeeping, and maintenance usually

happens after a failure is discovered; while this approach

would not be acceptable for protective devices, it would be

would be a reasonable way to manage office furniture or

lighting.

A simple and satisfactory relay fleet management

strategy would consist of age-based replacement. Other than

the recordkeeping required by regulations, this strategy

mainly depends on device installation data. However, as the

fleet size increases and/or fund are limited, asset managers

seek more proactive approaches. This increases the need for

proper inspection and monitoring, and requires adequate and

accurate recordkeeping of factors like performance, failure

Not all microprocessor relays

will fail securely, some will

trip. This must be accounted

for in design of critical

schemes.

6

rates, failure modes (through root-cause-analysis), and

various other aspects of the asset.

VII. RELAY ASSET DATABASE AND CHALLENGES

You cannot have an effective relay asset strategy without

data to determine and monitor the characteristics of the relay

fleet and its performance.

In today’s world Big Data is one of the latest buzzwords.

Executives want and expect to have more data and statistics

to gain insight into their business and to drive business

planning, especially asset strategy decisions which have a

large financial impact. Unfortunately, utilities are not the

best record keepers. The problem is the opposite of Big Data;

it is missing data, no data, or fragmented data dispersed in

multiple databases. There are no quick fixes or silver bullets

for correcting this deficiency. It is an area to strive for

continuous improvement that will require effort and diligence

to correct, likely over a long period of time.

When you have poor data quality, the data must be

scrubbed to analyze it, which is very inefficient. Multiple

databases with the same information introduce discrepancies

between different data sets. Missing information must be

gathered, or assumptions made to cover gaps, or useful data

extrapolated over the entire asset to cover gaps.

For PG&E, one thing that has improved the quality of

relay asset data is that the management of that data is now

under one organization. System Protection was assigned

responsibility for NERC PRC-005 compliance for relay

maintenance, and following that change the employees

responsible for SAP data entry for relays and their associated

maintenance plans were moved under the System Protection

organization. This provided direct control for data that is

critical to System Protection for both relay asset management

and relay maintenance. Prior to this the SAP relay data was

managed by the substation asset management organization

responsible for high voltage equipment, and there was a

disconnect and little interaction with System Protection

personnel who did not use SAP. Control of the relay asset

data has allowed the proper focus to be put on the importance

of this data and the ability to clean up and restructure the data

as needed.

Presently, PG&E relay asset data is dispersed in the

following databases.

SAP – Official company asset record and used for

work management, such as relay maintenance

triggers.

ASPEN Relay Database – System Protection relay

setting database.

Material Problem Report (MPR) Database – Used

for tracking relay failures. Entered by relay

technicians. Managed by Sourcing under System

Protection guidance.

Powerbase/RTS – Used for relay maintenance

records. Capturing the as-found condition of relays

when performing maintenance is especially

important as a data point for electromechanical relay

health and performance.

System Protection Deficiencies Database –

Nomination for relay replacements by Protection

Engineers.

Event Reporting System – Company outage

database. Contains detailed information for relay

misoperations including customer impact and

corrective actions.

Future plans are to consolidate databases where possible

and link information that is shared between databases to

eliminate duplication, improve efficiency, and minimize

discrepancies.

Figure 13 Database optimization

A. Relay Asset Data

In order to analyze the relay asset data you need

consistently entered data. One method of ensuring

consistency for certain fields is to use pick lists. Consolidate

databases where possible or link databases to share data for

duplicate fields to eliminate discrepancies.

Create pick lists for key relay fields such as:

o Manufacturer

o Model (main type, not style or order code)

o Scheme or Function

o Relay Class

Other important relay data fields include:

o Location

o Element Being Protected

o Relay Style/Order Code

o Serial Number

o Install Date

o Manufactured Date

o Firmware Version

o Service Advisories/Status

o Removal/Retired Date

B. Failure Data

Relay failure data is a key indicator for relay

performance and health. Microprocessor relays alarm when

they fail and this failure data can be captured if processes are

put in place. However there may be little relay failure data

available for electromechanical relays, which drift out of

calibration and fail silently.

7

Some key fields for relay failure data, in addition to

those already listed above, include:

Failure Date

Complaint/Description of Problem

Cause

Correction

Relay Trip/Misoperation (Yes/No)

Return Material Authorization (RMA) number from

manufacturer

Create pick lists for fields such as:

o Failed Component (power supply, CPU,

etc.)

o How Discovered (E.g., During

Maintenance, Alarm, Installation, Station

Inspection, Operation Inquiry, etc.)

C. Maintenance/Repair Data

Applying analytics to relay asset data will not only help

you to understand the information contained within the data,

but it will also help identify the data that is most important to

the business and future business decisions. It may also

identify data that is missing, which may require update of

processes and procedures to capture this new information.

For PG&E, one of the missing pieces of data that needs to be

captured is the as-found condition of relays when performing

routine relay maintenance. This is needed as a data point for

electromechanical relay health. It was found there was very

little relay failure data for electromechanical relays since they

fail silently, unlike microprocessor relays which alarm when

failed.

NERC PRC-005-2 allows an entity to use Performance

Based Maintenance to extend maintenance intervals as long

as Countable Events are kept within prescribed levels Even

if an entity only uses Time Based Maintenance, tracking

Countable Events is a valuable data point for relay health [2].

Tracking the as-found condition, whether the relay is

functioning properly and within calibration provides a good

data point for electromechanical relay performance and

health. Through a utility peer review, it was noted the policy

for one large utility was to replace a relay if it was found out

of calibration on two consecutive performances of routine

maintenance.

VIII. RISK DEFINITION

In order to further improve safety and reliability of the

utility services, regulators are interested in adopting risk

based, or risk informed, decision making for investments.

This would be in addition to various other factors currently

considered in the Rate Case [3]. The goal is to optimize the

cost allocated to safety and maximize the safety benefits / risk

reduction. To achieve this, a method needs to be established

that would allow the analysis and comparison of risk

associated with different asset classes on a common scale.

However, there are several challenges that need to be

overcome.

To better understand the challenges let’s start with

examining risk. One definition of risk according to Merriam-

Webster is “the possibility that something bad or unpleasant

(such as an injury or a loss) will happen” [4]. Based on this

definition it can be concluded that there are two aspects of

risk: (1) the probability (possibility) of a “bad” event

occurring i.e. relay failure, and (2) the impact (severity)

caused by the event i.e. misoperation leading to an outage.

For the purpose of this paper risk is defined as follows:

The probability of an event occurring can be determined

with relative ease if proper and accurate asset performance

data is available to analyze trends, but the current data quality

is marginally acceptable; however, moving forward steps can

be implemented to improve data quality. On the other hand,

determining an event’s potential impact to the system is not

as straightforward because numerous complex factors need to

be considered, such as: the state of the system, the type and

function of device that failed, the failure mode, etc.

IX. RELAY PERFORMANCE INDICES

In order to identify the worst performing relays in the

most critical areas of the system the following scoring

methodology was developed. Two main indices are used to

assess the entire fleet of 35K relays: Criticality Score and

Health Score. These two indices are decoupled, however a

final score can be created by a weighted combination.

Criticality Score – represents the potential impact that a

relay failure and/or misoperation can have on the system; a

higher score implies a greater impact. This score is

independent of the condition of the relay; it represents the

residual risk of any device in a particular part of the system

performing a particular function. For example: a bus

differential relay at an important substation protecting a

single bus single breaker would have a higher criticality score

than. a line differential relay at flip-flop station.

Countable Event – A failure of a

component requiring repair or replacement,

any condition discovered during the

maintenance activities in {PRC-005-2}

Tables 1-1 through 1-5 and Table 3 which

requires corrective action, or a

Misoperation attributed to hardware failure

or calibration failure. Misoperations due to

product design errors, software errors,

relay settings different from specified

settings, Protection System component

configuration errors, or Protection System

application errors are not included in

Countable Events.

8

The Substation Tier is a PG&E internal ranking that

groups substations into six tiers based on various factors such

as load and customers served, electrical location, power flow

path limits, etc. Table 2 lists the weight given to relays

located at a substation of a particular tier. The weights

assigned to protection Scheme and Bus Configuration are

shown in Table 3 and Table 4 respectively.

Table 2 Weight assigned to various Substation Tier

Substation Tier Count Weight

T1 3128 10

T2 873 8

T3 1029 6

T4 3721 4

T5 2004 2

Other 23291 1

Table 3 Weight assigned to various protection Schemes

Scheme Count Weight

A.C. Undervoltage 54 1

Annunciation 52 1

Automatics 4098 4

Breaker BU / Breaker Failure 2394 10

Bus Protection 2324 10

Bus Reactor 62 3

Capacitor Control 5 1

Capacitor Protection - Series 117 5

Capacitor Protection - Shunt 252 2

Condenser Protection 133 4

Current Balance 4 6

D.C. Undervoltage 9 9

Digital Fault Recorder 2 1

Direct Transfer Trip 701 7

Directional Comparison 1650 5

Directional Distance 2410 4

Directional Overcurrent 1362 2

Frequency Load Shedding 229 5

Line Current Differential 263 6

Non-Directional Overcurrent 10419 1

Other or Unknown 9 1

Phase Comparison 23 6

Power Load Shedding 2 2

PW Current Differential 437 6

RAS 778 8

Reactor Protection - Shunt 88 3

Regulator Protection 28 5

SCADA 231 1

Special Protection Scheme 268 7

Transformer Bank - Large 1635 6

Transformer Bank - Medium 3509 5

Transformer Bank - Small 419 5

Voltage Load Shedding 79 7

Table 4 Weight assigned to various Bus Configurations

Bus Config. Count Weight

BAAH 2471 1

DBDB 1869 1

DBSB 6990 10

DBSB & DBDB 67 5

FFLOP 870 6

LOOP 4520 5

M/A 6002 7

M/A & DBDB 178 8

M/A & DBSB 247 9

M/A & SBSB 77 8

N/A 857 1

Ring 1196 1

SBSB 1680 4

Synch Cond 28 1

Tap 956 4

Unknown 6038 2

Health Score – represents the condition of the relay and

correlates to the probability of failure; a higher score implies

a greater likelihood of failure. The health score is calculated

as follows:

The Performance is tracked on a relay make and model

basis using the historic issues available for the population of a

particular relay make and model. Currently PG&E tracks

different performance issues in different databases; for

example calibration issues found during maintenance is

tracked in a different database than the in-service failures.

Thus information from various databases had to be combined

to calculate the performance. Some other factors to consider

in improving the Performance metric would be: the

availability of spare parts, lead time for replacement units,

ease of replacement, service advisories etc.

( )

Where:

Each DataBase Entry and Failure was weighted the

same, but due to the unacceptable consequences of failures

that resulted in trips, these were separated and further

weighted by a factor of 10. The Normalizing Factor is used to

compensate for the differences in population among the

various relay make/model, and to ensure that the Performance

metric has a value between 0 to 10.

The Age information of individual relays is also

incorporated into the Health Score. The goal here is to sort

the individual relays within the population of the same

make/model, hence it only makes up 15% of the score. The

weight given to the age of electromechanical relays is

9

different than the solid-state microprocessor relays as seen in

Table 5 and Table 6. Table 5 Weight assigned to EM relay age

EM (years) Weight

> 40 10

30 - 40 8

20 - 29 6

10 - 19 3

< 10 1

Unknown 5

Table 6 Weight assigned to MP and SS relay age

MP or SS (years) Weight

> 20 10

15 - 20 8

10 - 14 6

5 - 9 3

< 5 1

Unknown 5

Based on the preceding method the entire PG&E relay

fleet was scored, and the Total Score for the entire fleet

sorted in descending order is shown in Figure 14. The key

take-away is that a very small percent of the population has a

high total score, and this is where the efforts of the targeted

relay replacement program should be focused. The blue curve

in Figure 15 represents the Total Score of the first 516 relays,

and the red data points represent the Total Score of the

corresponding relays if replaced with new relays; essentially

the Health Score is near zero and only the contribution of the

Criticality Score remains. An observation can be made that

there is a divergence between the actual score and the ideal

fleet within approximately the first hundred relays, thus

replacing these relays would provide the most “bang for the

buck”.

Figure 14 Total Score for the entire fleet of relays

Figure 15 Total Score of the poor performers

One drawback of the Total Score is that the values are

not intuitive, especially when trying to compare relays to

other asset classes. Therefore an effort was undertaken to

determine the annual probability of failure. The asset data

was analyzed to determine the annual failure rates based on

relay make and model. It was observed that a few relay types

did not have any records of failure, but were still scored as

poor performers due to many maintenance issues. Due to the

large number of maintenance notifications, the assumption

was made that the relays required corrective actions to be

recalibrated, repaired, or nominated for replacement therefore

these relays should also be assigned a failure rate based on

Performance. Another challenge was that the small

population (often less than 20 units) for some relay types

resulted in significant variation of the failure rate compared

to the Performance – some extreme cases were excluded from

the analysis. Figure 16 shows the performance (blue curve)

on the left Y-axis and the failure rate (red points) on the right

Y-axis. The black curve is the logarithmic trendline that was

used to assign failure rates to individual relays.

Figure 16 Comparing the performance to failure rate

Using the annual expected failure rate for the relays in

the entire fleet, a risk heat map was created as shown in

Figure 17. The relays in the highlighted area in upper right

are the high risk relays, and the size of this area can be

adjusted to determine the number of relays to replace based

on constraints such as budget and company’s risk tolerance.

Note: multiple data points (relays) overlap and appear as one,

10

so in order to determine the number of relays in the

highlighted area, the dataset needed to be filtered according

to the criteria selected.

Figure 17 PG&E relay fleet risk heat map

X. RELAY LIFE CYCLE REPLACEMENT PLAN

A. Sustainability Model Overview

A sustainability model that uses statistics to forecast

expected failures, life expectancy, survivability, and average

age over time has been developed to help determine the

replacement strategy. It is based on the following:

Three models were developed, one per relay class

(electromechanical (EM), solid state (SS), and

microprocessor (MP)). Each model simulates every

relay in the fleet for failure and replacement for

1000 Monte-Carlo iterations across a span of 50

years using a custom script in Microsoft Excel.

The relay model fleet was categorized by age. 18%

of the fleet (EM: 4272, SS: 1457, MP: 708) did not

have age data and are excluded from the simulation.

Inputs to all models require a failure rate curve,

replacement profile, replacement rate by Other work

(work beyond System Protection’s replacement list),

and a proactive replacement rate.

All relays will be replaced with MP class relays

EM and SS models have no proactive replacement

rate since they are not being replaced in kind. The

failed relays and replaced relays are removed from

the fleet. These removals are then used as inputs

(additions) to the MP model.

The SS and MP models will replace relays at age 20

years up to a maximum number defined by the user.

The MP model simulates for failure and replacement

within the microprocessor fleet and replaces each

failed relay with a new unit. This model also takes

the simulated removals from the EM and SS fleets

and replaces them with new units.

Historical data on relay removals was used to derive

the replacement profiles by calculating best fit trend

lines.

Failure rate curves for the EM and SS models were

estimated. The MP curve was calculated based on

historical data.

The algorithm for replacement is based on age and

the replacement profile defined by the user. The

replacement profile is compared to a random

number that the model produces and the

determination is made whether the relay will be

replaced or not depending on where the random

number falls.

The algorithm for failure is based on the failure rate

curve input defined by the user. It goes through a

similar test with a random number generator like the

replacement profile. However, it is unbiased to age.

B. Failure Rates

Ideally, failure rate, , curves should come from

historical data the utility observes over a specific time period.

However, PG&E simply does not have sufficient data on

recorded failures for EM and SS relays, as 90% of failures are

from MP relays. Therefore, EM and SS relay failure rate

curves were estimated with conservative assumptions.

Table 7 Failure rate assumptions

The points in Table 7 were plotted and an exponential

best fit failure curve was calculated off the data points. The

estimated failure rate curves for EM and SS relays are plotted

below in blue and red respectively in Figure 18.

Figure 18 Failure curves by relay class

11

The failure rate approximations based on the plotted data are:

PG&E has good and sufficient data on MP relay failures

over a six year time period to derive a failure rate curve. MP

relay failure data from 2008 to 2013 was summarized by age

and failure count. The MP annualized failure rate, MP, is

then calculated as:

Figure 19 Failure rate as a function of age

The failure rate is graphed as a function of age in Figure

19. The blue data points (Series 1) depict the annualized

failure rate of the MP relays. The red curve (Series 2) is the

calculated best fit exponential approximation. The MP failure

rate approximation is:

C. Replacement Profiles

Relay replacement profiles were derived by analyzing

historical relay removals over a six year period from 2008-

2013.

Relay removals were separated by relay class and

then categorized by age.

Removal counts and asset counts were obtained for

each age category and a moving average fleet was

calculated for each age category.

The moving average fleet was calculated by

averaging the asset counts for following 6 years and

dividing by 6.

The probability for replacement (annualized) was

calculated by taking the removal count divided by

the product of the moving average asset count and 6

years.

The replacement probability was graphed as a

function of age.

A linear or exponential best fit trend line was

calculated off the data points.

Figure 20 Replacement probability for EM relays

Figure 21 Replacement probability for SS relays

According to Figure 20 and Figure 21, the replacement

trend for EM and SS relays follow a somewhat linear pattern.

The linear replacement probability approximation results are:

Where:

( )

12

Figure 22 Replacement probability for MP relays

MP relays follow an exponential behavior as shown in Figure

22. The approximation calculated was:

D. Simulation Results from Sustainability Model

Electromechanical (EM) Model

The EM model results discussed in this paper are

simulated with a 4% annual removal rate by others and a zero

proactive replacement rate since EM relays are not replaced

in kind. 4% was selected because this ratio is closest to

historical PG&E replacement rates.

Figure 23 Fleet forecast for EM relays

Figure 24 Failure forecast for EM relays

The EM Fleet Waterfall Profile, Figure 23, and the EM

Failures and Forceouts, Figure 24, show that

electromechanical relays are a sustainable fleet. The blue

curve on the waterfall profile shows the fleet in year 1 and the

red curve shows the fleet in 30 years. There is no bubble in

the fleet profile and average EM relay failures are expected to

decrease over time (from 35 to less than 20 over 50 years) as

more relays are removed, aging units fail, and the population

decreases.

The number of expected failures forecasted by the model

is significantly greater than the reported failures PG&E

observes annually. Over the last six years, average reported

EM relay failures are about 2-3 a year and the model shows

approximately 35 a year. One can infer that EM relay failures

at PG&E are significantly under-reported. This is expected

as EM relays fail silently and do not have self-alarming

capabilities.

Figure 25, below shows a combination of two graphs, the

asset count histogram and the average age for the EM fleet.

This shows that the EM relays continue to age while the

population will drop significantly (red histogram in yr 30)

due to continued removals. No new relays are added, hence

the decline.

Figure 25 Asset count and average age forecast for EM relays

Solid State (SS) Model

The SS model results discussed in this paper are

simulated with a 4% annual removal rate by others, a zero

proactive replacement rate, and a maximum annual removal

rate of 100 relays at age 20 years or higher.

Figure 26 Fleet forecast for SS relays

13

Figure 27 Failure forecast for SS relays

The SS Fleet Waterfall Profile, Figure 26, and the SS

Failures and Forceouts, Figure 27, show that solid state relays

are a diminishing class of relays. In fact Figure 27 shows that

SS relays become extinct around year 20. Like EM relays, the

failures decrease over time but do so at a much sharper level.

One can also conclude that there is some under-reporting of

SS relay failures. Over the last six years, average reported SS

relay failures are about 4-5 a year and the model shows there

are around 12 failures a year.

The combination graphs below, Figure 28, show that SS

relays are non-existent in 30 years. There is no red histogram

chart and the average age drops at year 20 and is 0 around

year 22. This aligns with the Failures and Force-outs graph.

The SS fleet is expected to die out some time between years

20-22.

Figure 28 Asset count and average age forecast for SS relays

Microprocessor (MP) Model

The MP model results discussed in this paper are

simulated with a 4% annual replacement rate by others, a

proactive annual replacement rate of 50 relays per year, and a

maximum annual replacement rate of 1500 relays at age 20

years or higher.

Figure 29 Fleet forecast for MP relays

Figure 30 Failure forecast for MP relays

The Fleet Waterfall Profile, Figure 29, shows that

microprocessor relays are a sustainable fleet. The fleet profile

in 30 years (red curve) closely follows the current fleet

profile (blue curve). MP relay failures, Figure 30, show an

increasing trend with an average number of 45 failures in

year 1 to 60 failures in year 50. This failure trend is expected

as the waterfall curve shows an increase in the population at

year 30, by more than one-third (38%).

The asset histogram portion of the combination graph,

Figure 31, depict a fairly even distribution in the younger age

ranges between 0-10 years in the MP fleet in year 30 (red

histogram). There is an increase in the overall population and

no relays exist in 30 years that are older than 20 years old.

This is expected as the model takes in a separate replacement

policy to replace relays age 20 or older. The number of relays

in the fleet decrease as the units age. At around year 13, there

is decline in each subsequent year. Ideally, a fleet with fewer

older relays and more new relays is sustainable and desired.

The average age section of Figure 31, show that there is

an overall slight increase in the average age (up 1 year over

50 years) with a few dips along the way. The valleys in the

curve can be attributed to the separate replacement policy of

removing relays 20 years are older from the fleet. As a large

number of older relays are replaced with new relays, it brings

down the average age substantially.

14

Figure 31 Asset count and average age forecast for MP relays

Total Expected Failures Histogram (All Relay Classes)

The total probability for failures across all three classes

of relays in year 1 is shown in Figure 32. The highest

likelihood is that approximately 87 failures (3.7%

probability) will occur in year 1 with the chances decreasing

the further you move away from this peak.

Figure 32 Distribution of failure probability for all relay classes

Figure 33 Total failure forecast for all relay classes

Total Failures and Forceouts (All Relay Classes)

The overall number of failures shown in Figure 33, for

the entire relay fleet across all three classes of relays is

expected to decrease from approximately 90 in year 1 to less

than 80 in year 50. There are a couple of periods of increase,

which are primarily due to microprocessor relay failures

outnumbering both EM and SS relay failures. However, the

net result is a decrease in failures in year 50.

Life Expectancy and Useful Life

Asset life expectancy is the length of time until the asset

must be retired, replaced, or removed from service.

Determining when an asset reaches the end of its service life

generally entails consideration of the cost and effectiveness

of repair and maintenance actions that might be taken to

further extend the asset’s life expectancy [5]. The life

expectancy predictions from the relay models are listed in

Table 8. These predictions far exceed the life expectancies

typically associated with EM, SS, and MP relays at 40, 20,

and 15 years, respectively. These values are about half of the

model predictions and also exceed the useful life.

Table 8 Life expectancy of relay classes

Life Expectancy

(years) Electromechanical 77 Solid State 38 Microprocessor 30

Useful life of an asset is typically shorter than the life

expectancy of an asset. It is the time period the asset is in

service to prevent a run to failure situation or a run to poor

condition that deems the asset non-functional. PG&E has not

yet performed an analysis to determine relay useful life. It is

an item for future strategy consideration. IEEE PSRC

Working Group I22 is currently working on a report titled

“Condition Assessment of P&C Devices” to determine the

end-of-useful-life for protection, control, and monitoring

devices including electromechanical, solid-state, and

microprocessor-based devices [6].

Recommendations based on the Sustainability Model

The results of the model suggest that the strategy should

consist of various factors. The model demonstrates that

PG&E relays are a sustainable asset if the replacement rate

continues at around 4%, a policy is in place to replace SS and

MP relays at age 20, and a proactive replacement program is

in place to target high failure rate models at critical locations.

XI. CONCLUDING REMARKS – GUIDING PRINCIPLES

The transition to a microprocessor relay fleet has made it

apparent that the need for a sustainable relay strategy is

paramount. MP relays do not outlive other assets (e.g., HV

circuit breakers and transformers) like EM relays do. The

relay strategy should be multi-faceted. It should strive to

improve the life-cycle cost of relays by balancing service life

with cost of failure while addressing safety, reliability, and

compliance.

One facet of the strategy is to continue with the work

plan by Others, i.e., work triggered by other business drivers

outside of System Protection needs such as capacity,

reliability, or modernization. More than 90% of PG&E’s

relay installations over the last five years were driven by

TIME (YEARS)

FAIL

UR

ES/F

OR

CEO

UTS

15

Other work. It is likely that this trend of relays being installed

by Other work will increase in the next five years as the

breaker and transformer replacement plan is expected to

increase. This should be coupled with a targeted replacement

program based on the relay performance index described in

Section IX and a tracking process to link the relays to

projects to provide visibility of future work affecting relays.

Another facet of the strategy is continued tracking of

system failure rates and failure analysis to understand if

standardization and relay vendors selected are effective.

In addition to the facets focused on in this paper, a

successful relay asset strategy should be multi-dimensional

and also include additional facets to support the

implementation of the strategy, such as simple and inclusive

design standards to facilitate relay replacement, a manageable

budget and work plan, and resources to support execution.

XII. REFERENCES

[1] J. Sykes, A. Feathers, E. Udren and B. Gwyn, "CREATING A

SUSTAINABLE PROTECTIVE RELAY ASSET STRATEGY," in Western Protective Relay Conference, Washington State University,

2012.

[2] NERC, "Standard PRC-005-2 - Protection System Maintenance".

[3] CPUC, "CPUC Proceeding to Develop a Risk-Based Decision-Making

Framework to Evaluate Safety and Reliability Improvements and Revise

the General Rate Case Plan for Energy Utilities," 22 Nov 2013. [Online]. Available:

http://docs.cpuc.ca.gov/SearchRes.aspx?DocFormat=ALL&DocID=818

56126. [Accessed 14 Aug 2014].

[4] Merriam-Webster, "Risk," [Online]. Available: http://www.merriam-

webster.com/dictionary/risk. [Accessed 2 Sep 2014].

[5] Transportation Research Board, "Methodology for Estimating Life

Expectancies of Highway Assets," 2013. [Online]. Available:

http://apps.trb.org/cmsfeed/TRBNetProjectDisplay.asp?ProjectID=2497. [Accessed 04 Sep 2014].

[6] B. Beresh (Chair) and B. Mackie (Vice Chair), "Condition Assessment

of P&C Devices," IEEE PSRC Working Group - I22, Draft #7 (May/2014, post PSRC meeting).

[7] D. Ransom, "UPGRADING RELAY PROTECTION?—BE

PREPARED," in Western Protective Relay Conference, Washington State University, 2013.

XIII. BIOGRAPHY

Aaron Feathers is a Principal Engineer in System Protection at Pacific Gas and Electric Company, where he has been employed since 1992. He has 22

years of experience in the application of protective relaying and control

systems on transmission systems. Aaron's current job responsibilities include

design standards, wide area RAS support, NERC PRC compliance, and relay

asset management support. He has a BSEE degree from California State

Polytechnic University, San Luis Obispo and is a registered Professional Engineer in the State of California. He is also a member of IEEE and is on

the Western Protective Relay Conference planning committee and the NERC Protection System Maintenance Standard Drafting Team developing NERC

Standard PRC-005-X.

Abesh Mubaraki is an entry engineer in the Engineer Rotation and

Development Program at Pacific Gas and Electric Company, where he has

been employed since 2013. Abesh’s current rotation is in Transmission Operations Engineering, and his previous rotations were in System

Protection, and Substation and Transmission Line Asset Strategy. He has a

BSEE and MSEE degree from California State Polytechnic University, San Luis Obispo, and he is a member of IEEE

Ana Nungo is a Project Engineer in Substation Design and Engineering at Pacific Gas and Electric Company, where she has been employed since 2011.

Her prior experience includes Transmission Planning and Substation and

Transmission Line Asset Strategy. During her time in Asset Strategy, Ana's

responsibilities included working with System Protection to develop a joint

strategy for protective relays. She has a BSEE degree from University of

Illinois at Chicago and recently received her PMP certification. She is also a member of the Society of Hispanic Professional Engineers (SHPE).

Nai Paz is a Senior Electric Standards Engineer in Substation and Transmission Line Asset Strategy at Pacific Gas and Electric Company,

where she has been employed since 2003. She has 10 years of experience in

Substation Design and Engineering. Nai has been in her current role in Asset Strategy for 10 months. Her current job responsibilities include leading the

development of protective relay and SCADA strategies, performing risk,

data, and failure analyses, and developing risk based ranking methodologies for protective relay and SCADA projects. She has a BSEE degree from

California State University, Sacramento and is a registered Professional

Engineer in the State of California.