Relay Performance Index for a
Sustainable Relay Replacement Program
Aaron Feathers, Abesh Mubaraki, Ana Nungo, Nai Paz
Pacific Gas & Electric Company
Abstract - Many utilities now face a complex tangle of
challenges in managing the performance and reliability of
multiple generations of relays in service, as they formulate the
best strategy for designing and sequencing replacements.
The paper explores the challenges of creating a sustainable
relay upgrade and replacement program for a large utility with
a mixed fleet of new and old relay installations. PG&E has a
fleet of 35,000 relays from 12kV through 500kV. The relay fleet
is comprised largely of electromechanical and microprocessor
based relay types, with a small number of solid state. Budget
and resource limitations prohibit a strictly age based relay
replacement strategy. The paper will focus on the strategy from
data analytics to assess the risk of each relay in the fleet based
on a relay health and criticality score, and how this is used as a
basis for replacement prioritization.
Various relay risk factors are analyzed and weighted
including failure rate, misoperation rate, age, relay class,
scheme type, bus configuration, and customer count. The paper
will detail how each factor is weighted and the basis.
Characteristics of different relay types or generations for
longevity, and failure modes will be discussed and the impact to
the replacement strategy.
Tools for managing the protective relay fleet - asset,
maintenance, and configuration databases will be described, and
analysis of the data they capture.
The paper will show PG&E relay fleet data for the last 6
years and how the relay replacement strategy has affected relay
fleet statistics and performance.
Protective system asset owners can benefit from the fleet
management strategies presented to meet the demands of
today’s operating and replacement pressures, including long
range replacement plans and asset end-of-life decisions.
I. BACKGROUND
Pacific Gas and Electric Company (PG&E) is one of the
largest combination electric and gas utilities in the United
States. It serves about 15 million customers in northern and
central California. Approximately 20,000 employees serve its
70,000 square mile territory. It is a vertically integrated
utility with Generation, Transmission and Distribution assets.
Its transmission system is made up of approximately 18,300
miles of 500, 230, 115, 70 and 60 kV lines. Total number of
substations is approximately 860, with 35,000 relays.
Approximately 50% percent of the relays are microprocessor
type and 50% electromechanical type.
PG&E System Protection department started monitoring
relay performance (failures) in 2008 that was spurred on due
to quality issues with newer relays from a specific
manufacturer. Shortly thereafter discussions began taking
place regarding the aging PG&E relay fleet and concerns
over large numbers of relays reaching the end of expected life
for both first generation microprocessor relays and also a
large number of very old electromechanical relays. In the last
6 years PG&E has been accumulating relay performance and
relay fleet data and refining how this data is analyzed to
create a sustainable relay replacement strategy, as one leg of
a sustainable relay asset strategy.
PG&E and Quanta Technology co-authored a 2012
WPRC paper titled “Creating a Sustainable Protective Relay
Asset Strategy” which outlined the many facets of such a
strategy [1]. This paper is an extension of that one focusing
specifically on the relay asset and performance data, and how
PG&E has been analyzing and using this data over the last
two years to create a sustainable relay replacement strategy.
II. INTRODUCTION
Many utilities now face challenges in managing the
performance and reliability of multiple generations of
protective relays. Developing a sustainable relay replacement
strategy is necessary to maintain reliability of these devices.
This can be a simple or complex task depending on how a
utility approaches this challenge. A simple age based relay
replacement strategy may be effective, but may also be the
most costly. Competing budget priorities for limited funding,
resource limitations, and the large numbers of relays reaching
end of life may prohibit a strictly age based replacement
program, which was the case for PG&E.
To develop a sustainable relay replacement strategy, the
responsible team needs to:
Understand the differences in relay types or
generations
Assemble and track relay asset, maintenance, and
failure/performance data
Develop tools to analyze the data
Develop and deploy a strategy
The following sections address each of these topics. The
paper demonstrates how the PG&E team developed a relay
performance index and a sustainability model that forecasts
expected failures, life expectancy, survivability and average
age over time to help determine the replacement strategy.
III. CHARACTERISTICS OF RELAY GENERATIONS
Protection systems have evolved from assemblies of
single function electro-mechanical relays to complex
multifunctional microprocessor relays. Relay technology has
migrated from electromechanical to solid-state and then
microprocessor based devices. Each of these relay classes
has different characteristics which must be recognized for
creating an effective relay asset strategy. Expected life or
practical life of the devices, failure modes, and maintenance
requirements are some differences to understand to name a
few.
2
A. Electromechanical (EM) Relays
Electromechanical relays are the roots of system
protection. Typical life expectancy of these devices can be
up to 40 plus years and there are still many that are in service,
some up to 70 years old at PG&E. The oldest of these are
simple overcurrent relays that continue to perform. More
complex relays like distance relays with metal-can bathtub
capacitors have a shorter life as these components wear out.
One characteristic of electromechanical relays is that
they fail silently. An electromechanical relay failure is not
evident until it is discovered either through routine
maintenance, or following an operation inquiry due to relay
misoperation. The EM relays don’t typically fail but rather
drift and require calibration during routine maintenance. As
parts wear out in the relay it may not be able to be calibrated
within acceptable parameters and is considered failed.
Panel design for EM relays has a large number of
discrete relays, one per phase or zone. This yields some
inherent redundancy in EM relay schemes for relay failure.
B. Solid-State (SS) Relays
Solid state relays are the bridge between the
electromechanical relay generation and current
microprocessor relay generation. Solid state relays have an
expected life of about 20 years. Most of the solid state relay
generation are at the end of their expected life and will be
replaced, prior to many of their electromechanical counter
parts which have a longer service life. For this reason solid
state relays have been called a lost generation, while large
numbers or electromechanical and microprocessor relays
remain.
Solid state relays are typically set with dip switches and
dials, some more advanced types had menu driven settings,
while a few had software to interface with the device.
Software associated with these devices is typically not
supported by current computer operating systems. Solid state
relays at end of life may exhibit high failure and misoperation
rates due to electronic component failures. These failures
may not be evident until a routine test reveals the problem.
C. Microprocessor (MP) Relays
Microprocessor relays are the current generation of
protective relays. They began to be commonly applied in the
1990’s. First generation microprocessor relays are reaching
end of life, which has spurred industry concern over their
replacement. Expected useful life of MP relays is about 15 to
20 years. MP relays are highly reliable until certain
components begin to reach end of life, such as electrolytic
capacitors in the power supplies. MP relay useful life may
also be limited by the capability to support the device;
configuration software for older relays may no longer operate
on newer computers. Useful life may also be impacted by
compliance driven requirements, such as NERC CIP, which
could drive replacement.
MP relays have vast arrays of functions and capabilities.
A single MP relay can replace an entire panel of EM relays.
To achieve redundancy a second MP relay is installed to
cover the single failure criterion. MP relays include self-
diagnostic capabilities and relay failure alarming. The relay
will disable itself and alarm for the majority of hardware or
software failures. MP relays do not require calibration like
an EM relay so routine maintenance is less labor intensive.
IV. PG&E RELAY FLEET
In 2008 PG&E System Protection department began
work on developing a sustainable relay asset management
strategy. This emphasis followed being given responsibility
as the relay asset owner. Concerns were emerging over the
aging MP relays in the PG&E fleet approaching end of life
with no long range plans in place to address replacement.
Figure 1 below shows the initial assessment of the PG&E
relay fleet in 2008. The number of relays installed each year
is shown, broken down by relay class. This was a light bulb
moment for PG&E, realizing large numbers of MP relays
were approaching 15 years in age which may be near end-of-
life for these devices. A large population of EM relays was
also approaching 40 years in age. Several questions arose.
How would these relays perform as they continued to age?
What is an appropriate service life for a MP relay, SS relay,
and EM relay? How many relays do we need to replace each
year to maintain reliability? Do we have enough resources
and budget available? These questions could not be
immediately answered.
Figure 1 Age distribution of PG&E relay fleet in 2008
PG&E’s current relay fleet is comprised of
approximately 35,000 relays. Broken down by relay class,
48% are MP, 41% EM, and 11% are SS. The EM and SS
relays as a whole are at expected life expectancy as shown by
the statistics in Figure 2. The fleet is slowing transitioning
from EM to MP relays, but large populations of both exist.
The MP relay fleet on average is young, which is bolstered by
a ramp up in newly installed MP relays in the last six years.
3
Figure 2 Summary of PG&E's relay fleet
If we look at the age distribution for the current relay
fleet (data through 2013) in Figure 3, you can see the ramp up
in newly installed MP relays. Contrasting with Figure 1, you
can also see the reduction in EM relays since 2008 by the
lower peaks.
Figure 3 Age distribution of relays in 2013
It should be noted that the large increase in newly
installed MP relays was not driven by a relay asset strategy.
Nearly all of the newly installed relays had other drivers,
such as capacity, reliability, third party interconnections,
SCADA/Automation, protection deficiencies, or compliance
driven. Many of those projects included drop in place control
buildings, which replaced all relays in the existing control
rooms, regardless of age. This can be seen in Figure 4,
below. Relays that were replaced were not targeted by age
and had an even distribution across the entire relay fleet.
Figure 4 Relays replaced from 2008 to 2013
The affect of PG&E’s year by year relay replacements
on the fleet can be seen in the following figures. The relay
fleet is transitioning to MP type relays, with reduction in EM
and SS relays as shown in Figure 5. You can also see that the
overall number of relays in the fleet is being reduced, due to
the increase in multi-function MP relays that replace many
discrete EM relays.
Figure 5 Relay fleet profile by year
Figure 6 shows the average age of the relay fleet by relay
class. The EM relay average age is increasing each year
since minimal new EM relays are installed and the existing
EM fleet continues to age with whatever relays that remain.
The MP relay fleet average age is not increasing, but this is
skewed by the large increase in newly installed MP relays.
The older, first generation, MP relays continue to age and the
numbers of MP relays beyond 15 years in age continues to
increase as shown in Figure 7. The percent of relays beyond
expected life span is shown in Figure 8. Comparing Figure 7
and Figure 8 you can see that even though the number of EM
Key Questions:
Are the right relays being
replaced?
What are the effects of non-
targeted relay replacements?
4
relays beyond 40 years is decreasing, the percent of relays
beyond 40 years is increasing for those relays that are left.
Figure 6 Age of relay fleet trend
Figure 7 Number of relays beyond expected life span
Figure 8 Percent of relays beyond expected life span
V. PROTECTION SYSTEM PERFORMANCE DATA
To analyze relay performance, PG&E currently tracks
relay failures, and classifies them as a trip or “safe-mode”
failure depending on whether the failure resulted in a
misoperation or a nonoperation. Trips may or may not result
in an outage that affects customers. While unacceptable, a
failure resulting in a nonoperation is tolerable because
multiple levels of redundancy designed in the system ensures
that impact to equipment and safety is still minimal. In
contrast, a failure that causes a relay to misoperate is not
tolerable since it could cause an outage, affecting customers
and possibly system stability.
Over the course of six years (2008-2013) approximately
one percent of PG&E’s relay fleet has failed as shown in
Figure 9. The number of failures that resulted in trips is about
one tenth of one percent, yet relay failures contribute about
10% annually (5 year average) to the overall Substation
System Average Interruption Frequency Index (SAIFI) and
System Average Interruption Duration Index (SAIDI) indices
as shown in Table 1. In 2013, two major outages in the
PG&E system were a direct result of a misoperation caused
by relay failure. The graph shows that only a very small
percentage of relays fail, and of those that fail a high
percentage fail in a safe mode without any relay action or
resulting trip; the failure of about ten relays per year result in
a trip. However, for the small percentage of relays that fails
insecurely and causes a trip the consequences can be minimal
to severe depending on the protection scheme type and
location. If a method could be developed to specifically target
and replace these high risk relays, the benefits would be
substantial.
Figure 9 Relay failure performance
Table 1 Summary of relay failures that contributed to SAIFI
and SAIDI
Year 2009 2010 2011 2012 2013 5 yr.
Avg.
Relay failures
contribution to
substation SAIFI
(%)
11% 5.4% 8.9% 8.6% 18.1% 10.3%
Relay failures
contribution to
substation SAIDI
(%)
5.1% 2.7% 9.6% 6.5% 33.3% 10%
Figure 10 below shows PG&E relays failures from 2008
to 2013 broken down by relay class. It can be seen that the
far majority of reported relay failures are MP relay types.
EM relays fail silently and are unknown until found during
maintenance or operation inquiry and are underreported. SS
relay failures reported are few, but significant considering the
small population base. A large percentage of the SS relay
failures are reported due to misoperations due to aging
components in devices near end of life.
PG&E has seen a decreasing trend in reported relay
failures and some of this has been achieved due to close work
with relay manufacturers to address manufacturing defect
issues, relay firmware, or service advisory related such as
proactive replacement of relay power supplies. However the
5
large decrease cannot be entirely accounted for and part may
be due to underreporting of relay failures and an uptick has
been seen in 2014 data which is not shown here.
Figure 10 Relay failures 2008 to 2013
If any lesson can be learned from tracking relay failures
and the consequences, it is that not all MP relays will fail
securely, some will trip. MP relays have self-diagnostics that
will disable the relay for the majority of relay failures, but
some failure modes cannot be detected and may fool the relay
and cause a trip, such as an Analog-to-Digital module failure.
A large outage for PG&E in 2013 was caused by a single
MP relay failure in a low-impedance bus differential scheme.
The bus differential scheme had a separate MP relay for A, B
and C phases where each relay made independent bus
differential trip decisions. Following the outage, a corrective
measure was to add undervoltage supervision between the
bus differential relays to prevent a single relay failure from
causing a similar outage. For example, the A phase bus
differential relay must receive an A phase undervoltage
condition being monitored by the B or C phase relay. Each
relay monitors all three phases of the bus potential.
Figure 11 below shows how many PG&E relay failures had a
resulting relay trip (misopoeration) and the number of
outages affecting customers that resulted.
Figure 11 Relay failure caused trips and resulting outages
Relay failure data can be looked at in finer granularity by
age of the devices, specific manufacturer or even model. The
confidence in the results of analysis will depend on the
amount and quality of available data. Figure 12 below shows
MP relay failures by the age of the relays for a single (most
applied) manufacturer in the PG&E system. The graph
shows increasing MP relay failure rates after 15 years.
Figure 12 MP relay failure rate
VI. ASSET AND FLEET MANAGEMENT STRATEGIES
Asset management strategies cover the entire spectrum
from simple to complex depending on the type of asset, and
nearly every strategy requires, to varying degrees, some form
of inspection, maintenance, and recordkeeping. One simple
strategy would be a “run to failure” (reactive) approach that
requires virtually no recordkeeping, and maintenance usually
happens after a failure is discovered; while this approach
would not be acceptable for protective devices, it would be
would be a reasonable way to manage office furniture or
lighting.
A simple and satisfactory relay fleet management
strategy would consist of age-based replacement. Other than
the recordkeeping required by regulations, this strategy
mainly depends on device installation data. However, as the
fleet size increases and/or fund are limited, asset managers
seek more proactive approaches. This increases the need for
proper inspection and monitoring, and requires adequate and
accurate recordkeeping of factors like performance, failure
Not all microprocessor relays
will fail securely, some will
trip. This must be accounted
for in design of critical
schemes.
6
rates, failure modes (through root-cause-analysis), and
various other aspects of the asset.
VII. RELAY ASSET DATABASE AND CHALLENGES
You cannot have an effective relay asset strategy without
data to determine and monitor the characteristics of the relay
fleet and its performance.
In today’s world Big Data is one of the latest buzzwords.
Executives want and expect to have more data and statistics
to gain insight into their business and to drive business
planning, especially asset strategy decisions which have a
large financial impact. Unfortunately, utilities are not the
best record keepers. The problem is the opposite of Big Data;
it is missing data, no data, or fragmented data dispersed in
multiple databases. There are no quick fixes or silver bullets
for correcting this deficiency. It is an area to strive for
continuous improvement that will require effort and diligence
to correct, likely over a long period of time.
When you have poor data quality, the data must be
scrubbed to analyze it, which is very inefficient. Multiple
databases with the same information introduce discrepancies
between different data sets. Missing information must be
gathered, or assumptions made to cover gaps, or useful data
extrapolated over the entire asset to cover gaps.
For PG&E, one thing that has improved the quality of
relay asset data is that the management of that data is now
under one organization. System Protection was assigned
responsibility for NERC PRC-005 compliance for relay
maintenance, and following that change the employees
responsible for SAP data entry for relays and their associated
maintenance plans were moved under the System Protection
organization. This provided direct control for data that is
critical to System Protection for both relay asset management
and relay maintenance. Prior to this the SAP relay data was
managed by the substation asset management organization
responsible for high voltage equipment, and there was a
disconnect and little interaction with System Protection
personnel who did not use SAP. Control of the relay asset
data has allowed the proper focus to be put on the importance
of this data and the ability to clean up and restructure the data
as needed.
Presently, PG&E relay asset data is dispersed in the
following databases.
SAP – Official company asset record and used for
work management, such as relay maintenance
triggers.
ASPEN Relay Database – System Protection relay
setting database.
Material Problem Report (MPR) Database – Used
for tracking relay failures. Entered by relay
technicians. Managed by Sourcing under System
Protection guidance.
Powerbase/RTS – Used for relay maintenance
records. Capturing the as-found condition of relays
when performing maintenance is especially
important as a data point for electromechanical relay
health and performance.
System Protection Deficiencies Database –
Nomination for relay replacements by Protection
Engineers.
Event Reporting System – Company outage
database. Contains detailed information for relay
misoperations including customer impact and
corrective actions.
Future plans are to consolidate databases where possible
and link information that is shared between databases to
eliminate duplication, improve efficiency, and minimize
discrepancies.
Figure 13 Database optimization
A. Relay Asset Data
In order to analyze the relay asset data you need
consistently entered data. One method of ensuring
consistency for certain fields is to use pick lists. Consolidate
databases where possible or link databases to share data for
duplicate fields to eliminate discrepancies.
Create pick lists for key relay fields such as:
o Manufacturer
o Model (main type, not style or order code)
o Scheme or Function
o Relay Class
Other important relay data fields include:
o Location
o Element Being Protected
o Relay Style/Order Code
o Serial Number
o Install Date
o Manufactured Date
o Firmware Version
o Service Advisories/Status
o Removal/Retired Date
B. Failure Data
Relay failure data is a key indicator for relay
performance and health. Microprocessor relays alarm when
they fail and this failure data can be captured if processes are
put in place. However there may be little relay failure data
available for electromechanical relays, which drift out of
calibration and fail silently.
7
Some key fields for relay failure data, in addition to
those already listed above, include:
Failure Date
Complaint/Description of Problem
Cause
Correction
Relay Trip/Misoperation (Yes/No)
Return Material Authorization (RMA) number from
manufacturer
Create pick lists for fields such as:
o Failed Component (power supply, CPU,
etc.)
o How Discovered (E.g., During
Maintenance, Alarm, Installation, Station
Inspection, Operation Inquiry, etc.)
C. Maintenance/Repair Data
Applying analytics to relay asset data will not only help
you to understand the information contained within the data,
but it will also help identify the data that is most important to
the business and future business decisions. It may also
identify data that is missing, which may require update of
processes and procedures to capture this new information.
For PG&E, one of the missing pieces of data that needs to be
captured is the as-found condition of relays when performing
routine relay maintenance. This is needed as a data point for
electromechanical relay health. It was found there was very
little relay failure data for electromechanical relays since they
fail silently, unlike microprocessor relays which alarm when
failed.
NERC PRC-005-2 allows an entity to use Performance
Based Maintenance to extend maintenance intervals as long
as Countable Events are kept within prescribed levels Even
if an entity only uses Time Based Maintenance, tracking
Countable Events is a valuable data point for relay health [2].
Tracking the as-found condition, whether the relay is
functioning properly and within calibration provides a good
data point for electromechanical relay performance and
health. Through a utility peer review, it was noted the policy
for one large utility was to replace a relay if it was found out
of calibration on two consecutive performances of routine
maintenance.
VIII. RISK DEFINITION
In order to further improve safety and reliability of the
utility services, regulators are interested in adopting risk
based, or risk informed, decision making for investments.
This would be in addition to various other factors currently
considered in the Rate Case [3]. The goal is to optimize the
cost allocated to safety and maximize the safety benefits / risk
reduction. To achieve this, a method needs to be established
that would allow the analysis and comparison of risk
associated with different asset classes on a common scale.
However, there are several challenges that need to be
overcome.
To better understand the challenges let’s start with
examining risk. One definition of risk according to Merriam-
Webster is “the possibility that something bad or unpleasant
(such as an injury or a loss) will happen” [4]. Based on this
definition it can be concluded that there are two aspects of
risk: (1) the probability (possibility) of a “bad” event
occurring i.e. relay failure, and (2) the impact (severity)
caused by the event i.e. misoperation leading to an outage.
For the purpose of this paper risk is defined as follows:
The probability of an event occurring can be determined
with relative ease if proper and accurate asset performance
data is available to analyze trends, but the current data quality
is marginally acceptable; however, moving forward steps can
be implemented to improve data quality. On the other hand,
determining an event’s potential impact to the system is not
as straightforward because numerous complex factors need to
be considered, such as: the state of the system, the type and
function of device that failed, the failure mode, etc.
IX. RELAY PERFORMANCE INDICES
In order to identify the worst performing relays in the
most critical areas of the system the following scoring
methodology was developed. Two main indices are used to
assess the entire fleet of 35K relays: Criticality Score and
Health Score. These two indices are decoupled, however a
final score can be created by a weighted combination.
Criticality Score – represents the potential impact that a
relay failure and/or misoperation can have on the system; a
higher score implies a greater impact. This score is
independent of the condition of the relay; it represents the
residual risk of any device in a particular part of the system
performing a particular function. For example: a bus
differential relay at an important substation protecting a
single bus single breaker would have a higher criticality score
than. a line differential relay at flip-flop station.
Countable Event – A failure of a
component requiring repair or replacement,
any condition discovered during the
maintenance activities in {PRC-005-2}
Tables 1-1 through 1-5 and Table 3 which
requires corrective action, or a
Misoperation attributed to hardware failure
or calibration failure. Misoperations due to
product design errors, software errors,
relay settings different from specified
settings, Protection System component
configuration errors, or Protection System
application errors are not included in
Countable Events.
8
The Substation Tier is a PG&E internal ranking that
groups substations into six tiers based on various factors such
as load and customers served, electrical location, power flow
path limits, etc. Table 2 lists the weight given to relays
located at a substation of a particular tier. The weights
assigned to protection Scheme and Bus Configuration are
shown in Table 3 and Table 4 respectively.
Table 2 Weight assigned to various Substation Tier
Substation Tier Count Weight
T1 3128 10
T2 873 8
T3 1029 6
T4 3721 4
T5 2004 2
Other 23291 1
Table 3 Weight assigned to various protection Schemes
Scheme Count Weight
A.C. Undervoltage 54 1
Annunciation 52 1
Automatics 4098 4
Breaker BU / Breaker Failure 2394 10
Bus Protection 2324 10
Bus Reactor 62 3
Capacitor Control 5 1
Capacitor Protection - Series 117 5
Capacitor Protection - Shunt 252 2
Condenser Protection 133 4
Current Balance 4 6
D.C. Undervoltage 9 9
Digital Fault Recorder 2 1
Direct Transfer Trip 701 7
Directional Comparison 1650 5
Directional Distance 2410 4
Directional Overcurrent 1362 2
Frequency Load Shedding 229 5
Line Current Differential 263 6
Non-Directional Overcurrent 10419 1
Other or Unknown 9 1
Phase Comparison 23 6
Power Load Shedding 2 2
PW Current Differential 437 6
RAS 778 8
Reactor Protection - Shunt 88 3
Regulator Protection 28 5
SCADA 231 1
Special Protection Scheme 268 7
Transformer Bank - Large 1635 6
Transformer Bank - Medium 3509 5
Transformer Bank - Small 419 5
Voltage Load Shedding 79 7
Table 4 Weight assigned to various Bus Configurations
Bus Config. Count Weight
BAAH 2471 1
DBDB 1869 1
DBSB 6990 10
DBSB & DBDB 67 5
FFLOP 870 6
LOOP 4520 5
M/A 6002 7
M/A & DBDB 178 8
M/A & DBSB 247 9
M/A & SBSB 77 8
N/A 857 1
Ring 1196 1
SBSB 1680 4
Synch Cond 28 1
Tap 956 4
Unknown 6038 2
Health Score – represents the condition of the relay and
correlates to the probability of failure; a higher score implies
a greater likelihood of failure. The health score is calculated
as follows:
The Performance is tracked on a relay make and model
basis using the historic issues available for the population of a
particular relay make and model. Currently PG&E tracks
different performance issues in different databases; for
example calibration issues found during maintenance is
tracked in a different database than the in-service failures.
Thus information from various databases had to be combined
to calculate the performance. Some other factors to consider
in improving the Performance metric would be: the
availability of spare parts, lead time for replacement units,
ease of replacement, service advisories etc.
( )
Where:
Each DataBase Entry and Failure was weighted the
same, but due to the unacceptable consequences of failures
that resulted in trips, these were separated and further
weighted by a factor of 10. The Normalizing Factor is used to
compensate for the differences in population among the
various relay make/model, and to ensure that the Performance
metric has a value between 0 to 10.
The Age information of individual relays is also
incorporated into the Health Score. The goal here is to sort
the individual relays within the population of the same
make/model, hence it only makes up 15% of the score. The
weight given to the age of electromechanical relays is
9
different than the solid-state microprocessor relays as seen in
Table 5 and Table 6. Table 5 Weight assigned to EM relay age
EM (years) Weight
> 40 10
30 - 40 8
20 - 29 6
10 - 19 3
< 10 1
Unknown 5
Table 6 Weight assigned to MP and SS relay age
MP or SS (years) Weight
> 20 10
15 - 20 8
10 - 14 6
5 - 9 3
< 5 1
Unknown 5
Based on the preceding method the entire PG&E relay
fleet was scored, and the Total Score for the entire fleet
sorted in descending order is shown in Figure 14. The key
take-away is that a very small percent of the population has a
high total score, and this is where the efforts of the targeted
relay replacement program should be focused. The blue curve
in Figure 15 represents the Total Score of the first 516 relays,
and the red data points represent the Total Score of the
corresponding relays if replaced with new relays; essentially
the Health Score is near zero and only the contribution of the
Criticality Score remains. An observation can be made that
there is a divergence between the actual score and the ideal
fleet within approximately the first hundred relays, thus
replacing these relays would provide the most “bang for the
buck”.
Figure 14 Total Score for the entire fleet of relays
Figure 15 Total Score of the poor performers
One drawback of the Total Score is that the values are
not intuitive, especially when trying to compare relays to
other asset classes. Therefore an effort was undertaken to
determine the annual probability of failure. The asset data
was analyzed to determine the annual failure rates based on
relay make and model. It was observed that a few relay types
did not have any records of failure, but were still scored as
poor performers due to many maintenance issues. Due to the
large number of maintenance notifications, the assumption
was made that the relays required corrective actions to be
recalibrated, repaired, or nominated for replacement therefore
these relays should also be assigned a failure rate based on
Performance. Another challenge was that the small
population (often less than 20 units) for some relay types
resulted in significant variation of the failure rate compared
to the Performance – some extreme cases were excluded from
the analysis. Figure 16 shows the performance (blue curve)
on the left Y-axis and the failure rate (red points) on the right
Y-axis. The black curve is the logarithmic trendline that was
used to assign failure rates to individual relays.
Figure 16 Comparing the performance to failure rate
Using the annual expected failure rate for the relays in
the entire fleet, a risk heat map was created as shown in
Figure 17. The relays in the highlighted area in upper right
are the high risk relays, and the size of this area can be
adjusted to determine the number of relays to replace based
on constraints such as budget and company’s risk tolerance.
Note: multiple data points (relays) overlap and appear as one,
10
so in order to determine the number of relays in the
highlighted area, the dataset needed to be filtered according
to the criteria selected.
Figure 17 PG&E relay fleet risk heat map
X. RELAY LIFE CYCLE REPLACEMENT PLAN
A. Sustainability Model Overview
A sustainability model that uses statistics to forecast
expected failures, life expectancy, survivability, and average
age over time has been developed to help determine the
replacement strategy. It is based on the following:
Three models were developed, one per relay class
(electromechanical (EM), solid state (SS), and
microprocessor (MP)). Each model simulates every
relay in the fleet for failure and replacement for
1000 Monte-Carlo iterations across a span of 50
years using a custom script in Microsoft Excel.
The relay model fleet was categorized by age. 18%
of the fleet (EM: 4272, SS: 1457, MP: 708) did not
have age data and are excluded from the simulation.
Inputs to all models require a failure rate curve,
replacement profile, replacement rate by Other work
(work beyond System Protection’s replacement list),
and a proactive replacement rate.
All relays will be replaced with MP class relays
EM and SS models have no proactive replacement
rate since they are not being replaced in kind. The
failed relays and replaced relays are removed from
the fleet. These removals are then used as inputs
(additions) to the MP model.
The SS and MP models will replace relays at age 20
years up to a maximum number defined by the user.
The MP model simulates for failure and replacement
within the microprocessor fleet and replaces each
failed relay with a new unit. This model also takes
the simulated removals from the EM and SS fleets
and replaces them with new units.
Historical data on relay removals was used to derive
the replacement profiles by calculating best fit trend
lines.
Failure rate curves for the EM and SS models were
estimated. The MP curve was calculated based on
historical data.
The algorithm for replacement is based on age and
the replacement profile defined by the user. The
replacement profile is compared to a random
number that the model produces and the
determination is made whether the relay will be
replaced or not depending on where the random
number falls.
The algorithm for failure is based on the failure rate
curve input defined by the user. It goes through a
similar test with a random number generator like the
replacement profile. However, it is unbiased to age.
B. Failure Rates
Ideally, failure rate, , curves should come from
historical data the utility observes over a specific time period.
However, PG&E simply does not have sufficient data on
recorded failures for EM and SS relays, as 90% of failures are
from MP relays. Therefore, EM and SS relay failure rate
curves were estimated with conservative assumptions.
Table 7 Failure rate assumptions
The points in Table 7 were plotted and an exponential
best fit failure curve was calculated off the data points. The
estimated failure rate curves for EM and SS relays are plotted
below in blue and red respectively in Figure 18.
Figure 18 Failure curves by relay class
11
The failure rate approximations based on the plotted data are:
PG&E has good and sufficient data on MP relay failures
over a six year time period to derive a failure rate curve. MP
relay failure data from 2008 to 2013 was summarized by age
and failure count. The MP annualized failure rate, MP, is
then calculated as:
Figure 19 Failure rate as a function of age
The failure rate is graphed as a function of age in Figure
19. The blue data points (Series 1) depict the annualized
failure rate of the MP relays. The red curve (Series 2) is the
calculated best fit exponential approximation. The MP failure
rate approximation is:
C. Replacement Profiles
Relay replacement profiles were derived by analyzing
historical relay removals over a six year period from 2008-
2013.
Relay removals were separated by relay class and
then categorized by age.
Removal counts and asset counts were obtained for
each age category and a moving average fleet was
calculated for each age category.
The moving average fleet was calculated by
averaging the asset counts for following 6 years and
dividing by 6.
The probability for replacement (annualized) was
calculated by taking the removal count divided by
the product of the moving average asset count and 6
years.
The replacement probability was graphed as a
function of age.
A linear or exponential best fit trend line was
calculated off the data points.
Figure 20 Replacement probability for EM relays
Figure 21 Replacement probability for SS relays
According to Figure 20 and Figure 21, the replacement
trend for EM and SS relays follow a somewhat linear pattern.
The linear replacement probability approximation results are:
Where:
( )
12
Figure 22 Replacement probability for MP relays
MP relays follow an exponential behavior as shown in Figure
22. The approximation calculated was:
D. Simulation Results from Sustainability Model
Electromechanical (EM) Model
The EM model results discussed in this paper are
simulated with a 4% annual removal rate by others and a zero
proactive replacement rate since EM relays are not replaced
in kind. 4% was selected because this ratio is closest to
historical PG&E replacement rates.
Figure 23 Fleet forecast for EM relays
Figure 24 Failure forecast for EM relays
The EM Fleet Waterfall Profile, Figure 23, and the EM
Failures and Forceouts, Figure 24, show that
electromechanical relays are a sustainable fleet. The blue
curve on the waterfall profile shows the fleet in year 1 and the
red curve shows the fleet in 30 years. There is no bubble in
the fleet profile and average EM relay failures are expected to
decrease over time (from 35 to less than 20 over 50 years) as
more relays are removed, aging units fail, and the population
decreases.
The number of expected failures forecasted by the model
is significantly greater than the reported failures PG&E
observes annually. Over the last six years, average reported
EM relay failures are about 2-3 a year and the model shows
approximately 35 a year. One can infer that EM relay failures
at PG&E are significantly under-reported. This is expected
as EM relays fail silently and do not have self-alarming
capabilities.
Figure 25, below shows a combination of two graphs, the
asset count histogram and the average age for the EM fleet.
This shows that the EM relays continue to age while the
population will drop significantly (red histogram in yr 30)
due to continued removals. No new relays are added, hence
the decline.
Figure 25 Asset count and average age forecast for EM relays
Solid State (SS) Model
The SS model results discussed in this paper are
simulated with a 4% annual removal rate by others, a zero
proactive replacement rate, and a maximum annual removal
rate of 100 relays at age 20 years or higher.
Figure 26 Fleet forecast for SS relays
13
Figure 27 Failure forecast for SS relays
The SS Fleet Waterfall Profile, Figure 26, and the SS
Failures and Forceouts, Figure 27, show that solid state relays
are a diminishing class of relays. In fact Figure 27 shows that
SS relays become extinct around year 20. Like EM relays, the
failures decrease over time but do so at a much sharper level.
One can also conclude that there is some under-reporting of
SS relay failures. Over the last six years, average reported SS
relay failures are about 4-5 a year and the model shows there
are around 12 failures a year.
The combination graphs below, Figure 28, show that SS
relays are non-existent in 30 years. There is no red histogram
chart and the average age drops at year 20 and is 0 around
year 22. This aligns with the Failures and Force-outs graph.
The SS fleet is expected to die out some time between years
20-22.
Figure 28 Asset count and average age forecast for SS relays
Microprocessor (MP) Model
The MP model results discussed in this paper are
simulated with a 4% annual replacement rate by others, a
proactive annual replacement rate of 50 relays per year, and a
maximum annual replacement rate of 1500 relays at age 20
years or higher.
Figure 29 Fleet forecast for MP relays
Figure 30 Failure forecast for MP relays
The Fleet Waterfall Profile, Figure 29, shows that
microprocessor relays are a sustainable fleet. The fleet profile
in 30 years (red curve) closely follows the current fleet
profile (blue curve). MP relay failures, Figure 30, show an
increasing trend with an average number of 45 failures in
year 1 to 60 failures in year 50. This failure trend is expected
as the waterfall curve shows an increase in the population at
year 30, by more than one-third (38%).
The asset histogram portion of the combination graph,
Figure 31, depict a fairly even distribution in the younger age
ranges between 0-10 years in the MP fleet in year 30 (red
histogram). There is an increase in the overall population and
no relays exist in 30 years that are older than 20 years old.
This is expected as the model takes in a separate replacement
policy to replace relays age 20 or older. The number of relays
in the fleet decrease as the units age. At around year 13, there
is decline in each subsequent year. Ideally, a fleet with fewer
older relays and more new relays is sustainable and desired.
The average age section of Figure 31, show that there is
an overall slight increase in the average age (up 1 year over
50 years) with a few dips along the way. The valleys in the
curve can be attributed to the separate replacement policy of
removing relays 20 years are older from the fleet. As a large
number of older relays are replaced with new relays, it brings
down the average age substantially.
14
Figure 31 Asset count and average age forecast for MP relays
Total Expected Failures Histogram (All Relay Classes)
The total probability for failures across all three classes
of relays in year 1 is shown in Figure 32. The highest
likelihood is that approximately 87 failures (3.7%
probability) will occur in year 1 with the chances decreasing
the further you move away from this peak.
Figure 32 Distribution of failure probability for all relay classes
Figure 33 Total failure forecast for all relay classes
Total Failures and Forceouts (All Relay Classes)
The overall number of failures shown in Figure 33, for
the entire relay fleet across all three classes of relays is
expected to decrease from approximately 90 in year 1 to less
than 80 in year 50. There are a couple of periods of increase,
which are primarily due to microprocessor relay failures
outnumbering both EM and SS relay failures. However, the
net result is a decrease in failures in year 50.
Life Expectancy and Useful Life
Asset life expectancy is the length of time until the asset
must be retired, replaced, or removed from service.
Determining when an asset reaches the end of its service life
generally entails consideration of the cost and effectiveness
of repair and maintenance actions that might be taken to
further extend the asset’s life expectancy [5]. The life
expectancy predictions from the relay models are listed in
Table 8. These predictions far exceed the life expectancies
typically associated with EM, SS, and MP relays at 40, 20,
and 15 years, respectively. These values are about half of the
model predictions and also exceed the useful life.
Table 8 Life expectancy of relay classes
Life Expectancy
(years) Electromechanical 77 Solid State 38 Microprocessor 30
Useful life of an asset is typically shorter than the life
expectancy of an asset. It is the time period the asset is in
service to prevent a run to failure situation or a run to poor
condition that deems the asset non-functional. PG&E has not
yet performed an analysis to determine relay useful life. It is
an item for future strategy consideration. IEEE PSRC
Working Group I22 is currently working on a report titled
“Condition Assessment of P&C Devices” to determine the
end-of-useful-life for protection, control, and monitoring
devices including electromechanical, solid-state, and
microprocessor-based devices [6].
Recommendations based on the Sustainability Model
The results of the model suggest that the strategy should
consist of various factors. The model demonstrates that
PG&E relays are a sustainable asset if the replacement rate
continues at around 4%, a policy is in place to replace SS and
MP relays at age 20, and a proactive replacement program is
in place to target high failure rate models at critical locations.
XI. CONCLUDING REMARKS – GUIDING PRINCIPLES
The transition to a microprocessor relay fleet has made it
apparent that the need for a sustainable relay strategy is
paramount. MP relays do not outlive other assets (e.g., HV
circuit breakers and transformers) like EM relays do. The
relay strategy should be multi-faceted. It should strive to
improve the life-cycle cost of relays by balancing service life
with cost of failure while addressing safety, reliability, and
compliance.
One facet of the strategy is to continue with the work
plan by Others, i.e., work triggered by other business drivers
outside of System Protection needs such as capacity,
reliability, or modernization. More than 90% of PG&E’s
relay installations over the last five years were driven by
TIME (YEARS)
FAIL
UR
ES/F
OR
CEO
UTS
15
Other work. It is likely that this trend of relays being installed
by Other work will increase in the next five years as the
breaker and transformer replacement plan is expected to
increase. This should be coupled with a targeted replacement
program based on the relay performance index described in
Section IX and a tracking process to link the relays to
projects to provide visibility of future work affecting relays.
Another facet of the strategy is continued tracking of
system failure rates and failure analysis to understand if
standardization and relay vendors selected are effective.
In addition to the facets focused on in this paper, a
successful relay asset strategy should be multi-dimensional
and also include additional facets to support the
implementation of the strategy, such as simple and inclusive
design standards to facilitate relay replacement, a manageable
budget and work plan, and resources to support execution.
XII. REFERENCES
[1] J. Sykes, A. Feathers, E. Udren and B. Gwyn, "CREATING A
SUSTAINABLE PROTECTIVE RELAY ASSET STRATEGY," in Western Protective Relay Conference, Washington State University,
2012.
[2] NERC, "Standard PRC-005-2 - Protection System Maintenance".
[3] CPUC, "CPUC Proceeding to Develop a Risk-Based Decision-Making
Framework to Evaluate Safety and Reliability Improvements and Revise
the General Rate Case Plan for Energy Utilities," 22 Nov 2013. [Online]. Available:
http://docs.cpuc.ca.gov/SearchRes.aspx?DocFormat=ALL&DocID=818
56126. [Accessed 14 Aug 2014].
[4] Merriam-Webster, "Risk," [Online]. Available: http://www.merriam-
webster.com/dictionary/risk. [Accessed 2 Sep 2014].
[5] Transportation Research Board, "Methodology for Estimating Life
Expectancies of Highway Assets," 2013. [Online]. Available:
http://apps.trb.org/cmsfeed/TRBNetProjectDisplay.asp?ProjectID=2497. [Accessed 04 Sep 2014].
[6] B. Beresh (Chair) and B. Mackie (Vice Chair), "Condition Assessment
of P&C Devices," IEEE PSRC Working Group - I22, Draft #7 (May/2014, post PSRC meeting).
[7] D. Ransom, "UPGRADING RELAY PROTECTION?—BE
PREPARED," in Western Protective Relay Conference, Washington State University, 2013.
XIII. BIOGRAPHY
Aaron Feathers is a Principal Engineer in System Protection at Pacific Gas and Electric Company, where he has been employed since 1992. He has 22
years of experience in the application of protective relaying and control
systems on transmission systems. Aaron's current job responsibilities include
design standards, wide area RAS support, NERC PRC compliance, and relay
asset management support. He has a BSEE degree from California State
Polytechnic University, San Luis Obispo and is a registered Professional Engineer in the State of California. He is also a member of IEEE and is on
the Western Protective Relay Conference planning committee and the NERC Protection System Maintenance Standard Drafting Team developing NERC
Standard PRC-005-X.
Abesh Mubaraki is an entry engineer in the Engineer Rotation and
Development Program at Pacific Gas and Electric Company, where he has
been employed since 2013. Abesh’s current rotation is in Transmission Operations Engineering, and his previous rotations were in System
Protection, and Substation and Transmission Line Asset Strategy. He has a
BSEE and MSEE degree from California State Polytechnic University, San Luis Obispo, and he is a member of IEEE
Ana Nungo is a Project Engineer in Substation Design and Engineering at Pacific Gas and Electric Company, where she has been employed since 2011.
Her prior experience includes Transmission Planning and Substation and
Transmission Line Asset Strategy. During her time in Asset Strategy, Ana's
responsibilities included working with System Protection to develop a joint
strategy for protective relays. She has a BSEE degree from University of
Illinois at Chicago and recently received her PMP certification. She is also a member of the Society of Hispanic Professional Engineers (SHPE).
Nai Paz is a Senior Electric Standards Engineer in Substation and Transmission Line Asset Strategy at Pacific Gas and Electric Company,
where she has been employed since 2003. She has 10 years of experience in
Substation Design and Engineering. Nai has been in her current role in Asset Strategy for 10 months. Her current job responsibilities include leading the
development of protective relay and SCADA strategies, performing risk,
data, and failure analyses, and developing risk based ranking methodologies for protective relay and SCADA projects. She has a BSEE degree from
California State University, Sacramento and is a registered Professional
Engineer in the State of California.